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logic implementing the aforementioned methodology are also disclosed. 
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1. Field of the Invention 



The present invention relates to the field of digital data processor design, 
specifically to the control and operation of the instruction pipeline of the processor and 
stmctures associated therewith. 
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2. Description of Related Technology 

RISC (or reduced instruction set computer) processors are well known in the 
5 computing arts. BISC processors generally have the fundamental characteristic of utilizing 
a substantially reduced instruction set as compared to non-RISC (commonly known as 
"CISC") processors. Typically, RISC processor machine instructions are not all micro- 
coded, but rather may be executed immediately without decoding, thereby affording 
significant economies in terms of processing speed. This "streamlined" instraction handling 
10 capability furthermore allows greater simplicity in the design of the processor (as compared 
to non-RISC devices), thereby allowing smaller silicon and reduced cost of fabrication. 

RISC processors are also typically characterized by (i) load/store memory 
architecture (i.e., only the load and store instructions have access to memory; other 
instructions operate via internal registers within the processor); (ii) unity of processor and 
15 compiler; and (iii) pipelining. 

Despite their many advantages, RISC processors may be prone to significant delays 
or stalls within their pipelines. These delays stem from a variety of causes, includmg the 
design and operation of the mstmction set of the processor (e.g., the use of multi-word 
and/or 'Tjreakpoint" mstructions within the processor's instruction set), the use of non- 
20 optunized bypass logic for operand routing during the execution of certain types of 
instructions, and the non-optimized integration (or lack of integration) of the data cache 
within the pipeline. Furthermore, lack of paralleUsm in the operation of the pipeUne can 
result in critical path delays which reduce performance. These aspects are described below 
in greater detail. 

25 

Multi-word Insti^uctions 

Many RISC processors oBEct programmers the opportunity to use instructions that 
span multiple words. Some multi-word instructions permit a greater number of operands 
and addressmg modes while others enable a wider range of unmediate data values. For 
30 multi-word immediate data, the pipelined execution of mstructions has some inherent 
Umitations including, mter alia, the potential for an instruction containing long unmediate 
data to be impacted by a pipelme stall before the long unmediate data has been completely 
fetched from memory. This stalling of an mcompletely fetched piece of data has several 
ramifications, one of which is that the otherwise executable mstmction may be stalled 





WO 01/69378 



PCT/USOl/07360 



3 



before it is necessary to do so. This leads to increased execution time and overhead within 
the processor. Stalling of the processor due to unavailabiliy of data causes the processor to 
insert one or more additional clock cycles. During these clock cycles the processor can not 
advance additional instruction execution as a general rule. This is because the incomplete 

5 data can be considered to be a blocking function. This blocking action is to cause execution 
to remain pending until the data becomes available. For example, consider a simple add 
instruction that adds two quanities and places the result in a third location. Providing that 
both pieces of data are available when needed, the execution completes in the normal 
number of cycles. Now consider the case in which one of the pieces of data is not available. 

10 In this case completion of the add instruction must stop until the data becomes available. 
The consequence of this stalling action is to possibly delay the completion by more than the 
minimimi necessary time. 

Breakpoint Instructions 

15 One of the useful RISC instructions is the '^breakpoint" instruction. Chiefly for use 

during the design and implementation phases of the processor (e.g., software/hardware 
integration and software debug), the breakpoint instruction causes the CPU to stop 
execution of any ftirfher instructions without some type of direct intervention, typically at 
the request of an operator. Once the breakpoint instraction has been executed by the 

20 pipeline, the CPU stops finther processing until it receives some extemal signal such as ani 
interrupt which signals to the CPU that execution should resume. Breakpoint instructions 
typically replace or displace some other executable instruction which is subsequently 
executed upon resumption of the normal execution state of the CPU. 



25 paths in the decode phase of a multi-stage pipelined CPU is an important consideration. 
Since the breakpoint instmction is a performance critical instruction during normal 
execution, the prior art practice has been to perform the breakpoint instraction decode and 
execution in the first pipeline stage of the typical four-stage pipeline (i.e., fetch, decode, 
execution, and write-back stages). Fig. 1 illustrates a typical prior art breaIq>oint instruction 

30 decode architecture. As shown in Fig. 1, the prior art stage 1 conjfiguration 100 comprises 
the stage 1 latch 102, instrtiction cache 104, instruction decode logic 106, instraction 
request address selection logic 108, the latter providing input to the stage 2 latches 110. 
The current program counter (pc) address value is input 112 back to the stage 1 latch 102 
for subsequent instruction fetch. Instraction decode, including decode of any breakpoint 



Execution time is critical for many applications, hence minimizing so-called critical 
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instructions, occurs within the instruction decode logic 106. However, such decoding in the 
first stage places unnecessary demands on the speed path of ordinary instruction handling. 
Ordinary instructions are decoded in stage 2 (decode) of the pipeline. This stage one decode 
of the breakpoint instruction places minimum decode requirements on the first stage that 
5 are longer than would otherwise be required without havmg breakpoint instruction decode 
occur in the first stage. This result is due largely to the fact that the breakpoint instruction 
requires time to setup and disable a variety of functional blocks. For example, in the ARC 
™ extensible RISC processor architecture manufactured by the Assignee hereof, functional 
blocks may include optional mulitply-accumlate hardware, Viterbi acceleration units, and 
10 other specific hardware accelerators in addition to standard functional blocks such as an 
arithmetic-logic unit, address generator units, interrupt processors and peripheral devices. 
Setup for each of these units will depend on the exact nature of the unit. For example, a 
single cycle unit for vdiich state information is not required for the unit to function, may 
require no specialized set up. By contrast, an operation that requires mulitple pipeline 
15 stages to complete will require assertion of signals within the pipeline to ensure that and 
transitory results are safely stored in appropriate registers. Where as other instructions are 
simply fetched in stage 1, the breakpoint instruction requires control signals to be generated 
to most elements of the core. This results in longer netUsts and hence greater delays. 



20 Bypass logic 

Bypass logic is sometimes used in RISC processors (such as the aforementioned 
ARC core) to provide additional flexibiUty for routmg operands to a variety of input 
options. For example, as illustrated in Fig. 2, outputs of various fimctionai units (such as 
the first and second execute stage result selection logic) are routed back to the input of 

25 another fimctionai unit; e.g., decode stage bypass operand selection logic. This bypass 
arrangement eliminates a number of load and store operations, reduces the number of 
temporary variable locations needed during execution, and stages data m the proper 
location for iterative operations. Such bypass arrangements permit software to exploit the 
nature of pipelined instruction execution. Using the prior art bypass circuitry of Fig. 2, a 

30 program can be configured to perform pipelined iterative algorithms. One such algorithm is 
the sum-of-products for a finite series. Since the processor performs scalar operations, each 
stage of tiie summation is achieved by a single multiply followed by a single addition of the 
result to a sum. This principal is illustrated by the following operation: 
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Sum=0 

For 1=1 to n do 
Suin=sum+(a(I)*b(I)); 



5 In a commonly used prior art CPU scheme, the value of Sum is stored in a dedicated 
general purpose register or in a memory location. Each iteration requires a memory fetch or 
register access operation to calculate the next summation in the series. Since the CPU can 
only perform a limited number of memory or register accesses per cycle, this form may 
execute relatively slowly in comparison to a single cycle ideal for the sum-of-products 

10 operation (i-e., where the sum-of-products is calculated entirely within a single instruction 
cycle), or even in comparison to a non-single cycle operation where memory fetches or 
register accesses are not reqxiired in each iteration of the operation. 

Data Cache Integration 

15 For a number of instruction types within the instruction set of the typical RISC 

processor, there in no requirement for or need to stall the pipeline. However, some other 
instruction types will require a stall to occur. The ordinary prior art method for integrating a 
data cache with a processor core relies on a technique that assumes that the worst case 
evaluation for stalls must be applied to even those cases where the most extreme case 

20 specifically does not apply. This 'Vorst case" approach leads to an increased number of 
pipeline stalls (and/or increased duration for each stall) as well as increased overhead, 
thereby res\ilting ultimately in increased execution time and reduced pipeline speed. 

Fig. 3 is a logical block diagram illustrating typical prior art data cache integration. 
It assumes tiie cache request originates directly from the pipeline rather than tiie load store 

25 queue. Note the presence of the bypass operand selection logic 302, the control logic hazard 
detection logic 304, and the multi-level latch control logic 306 structures within the second 
(E2) execution stage . 

Fig. 3 a illustrates the operation of the typical prior art data cache structure of Fig. 3 
in the context of an exemplary load (Ld), move (Mov), and add (Add) instruction sequence. 

30 The exemplary instruction sequence is as follows: 



Ldr0,[rl,4] 

Mov r5,r4 ;independent of the load 
Add r8,r0,r9 ;dependent on first load 
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First, in step 350, the Load (Ld) is requested. The Mov is then requested in step 352. 
In step 354, the Add is requested. The Ld operation begins in step 356. Next, the Mov 
operation begins in step 358. The cache misses. Accordingly, the Add is then prevented 
from movii^. 

5 In step 360, the Mov continues to flow down the pipeline. In step 362, the Add 

moves down the pipeline in response to the Load operation completing. The pipehne then 
flows with no stalls (steps 364, 366, and 368). 

Note that in the foregoing example, the Add instruction is prevented from moving 
from the decode stage of the pipeline to the first execute stage (El) for several cycles. This 

10 negatively impacts pipeline performance by slowing the execution of the Add instruction. 

Pipeline Parallelism 

Often in prior art processor systems, the instruction cache pipeline mtegration is far 
from optimal. This results in many cases from the core effectively making the cache 
15 pipeline stages 0 and 1 dependent on each other. This can be seen diagrammatically in Fig. 
4, wherein the pipeUne control 402, instruction decode 404, nextpc selection 406, and 
instruction cache address selection 408, are disposed in the instruction fetch stage 412 of 
tiie pipeline. The critical path of tins non-optimized pipeline 400 allows the control patii of 
the processor to be influenced by a slow signal/data patii. Accordingly the slow data patii 
20 must be removed if the performance of the core is to be improved. For example, in most 
core build instances, the prior art approach means tiie instruction fetch pipeline stage has an 
unequal duration to the otiier pipeline stages, and in general becomes the luniting factor in 
processor performance since it limits the minimum clock period. 

Fig. 4a is a block diagram of components and instruction flow witiiin tiie non- 
25 optimized processor design of Fig. 4. As illustrated in Fig. 4a, tixe slow signaUdata patii 
influences the control path for the pipeline 400. 

Based on tiie foregoing, there is a need for an improved metiiods and apparatiis for 
enhancing pipeline operation, including reducing stalls and delays in CPU execution. 
Ideally, several aspects of pipeline operation would be optimized by such improved metiiod 
30 and apparatiis, including (i) handlmg of multi-word instructions and immediate data, such 
as in tiie calculation of such scalar quantities witii a reduced number of memory fetches or 
register accesses; (ii) use of breakpoint instructions; (iii) bypass logic arrangement, (iv) 
data cache operation/integration, and (v) increased paralleUsm witiiin tiie pipeline. 
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Additionally, such improved apparatus and method would be readily adapted to existing 
processor designs and architectures, thereby minimizing the work necessary to integrate 
such functionality, as well as the impact on the processor design as a whole. 

5 Summary of the Invention 

The foregoing needs are satisfied by providing an improved method and apparatus for 
enhanced performance in a pipelined processor. 

In a first aspect of the invention, a method and apparatus for avoiding the stalling of 
long immediate data instructions, so that processor performance is maximized, is disclosed. 

10 The invention results in not enabling the host to halt the core before an instruction with long 
immediate values in the decode stage of the pipeline has merged, thereby advantageously 
making the instructions containing long immediate data "non-stallable" on the boundary 
between the instruction opcode and the immediate data. Consequently the instruction 
containing long immediate data is treated as if the CPU was wider in word width for that 

1 5 instruction only. The method generally comprises providing a first instmction word; providing 
a second instruction word; and defining a single large instruction word comprising the first 
and second instruction words; wherein the single large instraction word is processed as a 
single instruction within the processor's pipeline, thereby reducing pipeline delays. 

In a second aspect of the invention, an improved apparatus for decoding and executing 

20 breaiqDoint instructions, so that processor pipeline performance is maximized, is disclosed. In 
one exemplary embodiment, the apparatus comprises a pipeline arrangement with instmction 
decode logic operatively located within the second stage (e.g., decode stage) of the pipeline, 
thereby facilitating breakpoint instruction decode in the second stage versus the first stage as 
in prior art systems. Such decode in the second stage removes several critical "blockages*' 

25 within the pipeline, and enhances execution speed by increasing parallelism therein. 

In a third aspect of the invention, an improved method for decoding and executing 
breakpoint instructions, so that processor pipeline performance is maximized, is disclosed. 
Generally, the method comprises providing a pipeline having at least first, second, and third 
stages; providing a breakpoint instruction word, the breakpoint instruction word resulting in a 

30 stall of the pipeline when executed; inserting the breakpoint instmction word into the first 
stage of the pipeline; and delaying decode of the breakpoint instruction word until the second 
stage of the pipeline. In one exemplary embodiment, the pipeline is a four stage pipeline 
having fetch, decode, execution, and write-back stages, and decode of the breakpoint 
. instruction is delayed xmtil the decode stage of the processor. Additionally, to support the 
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decoding the breakpoint instruction in the decode stage, the method further comprises 
changing the program coimter (pc) from the cvirrent value to a brealq)oint pc value. 

In a fourth aspect of the invention, an improved method of debugging a processor 
design is disclosed. The method generally comprises providing a processor hardware design 
5 having a multi-stage pipeUne; providing an instruction set including at least one breakpoint 
instruction adapted for use with the processor hardware design; running at least, a portion of 
the instruction set (including the breakpoint mstruction) on the processor design during 
debug; decoding the at least one breakpoint instruction at the second stage of the pipeline; 
changing the program counter (pc) from the current value to a breakpoint pc value; 
1 0 executing the breakpoint instruction on order to halt processor operation; and debugging the 
instruction set or hardware/instruction set integration vs^iile the processor is halted. 

In a fifth aspect of the invention, an apparatus for bypassing various components 
and registers within a processor so as to maximize pipeline performance is disclosed, hi one 
embodiment, the apparatus comprises an improved logical arrangement employing a 
15 special multi-function register havmg a selectable "bypass mode"; when m bypass mode, 
the multi-fimction register is used to retain the result of a multi-cycle scalar operation (e.g., 
summation in a sum-of-products calculation), and present this result as a value to be selected 
from by a subsequent instruction. In this fashion, memory accesses to obtain such summation 
are substantially obviated, and the pipeline accordingly operates at a higher speed due to 
20 elimination of the delays associated witii the obviated memory accesses. 

In a sixth aspect of the mvention, a method for bypassing various conqwnents and 
registers within a processor so as to maximize processor performance is disclosed. Li one 
embodiment, the method comprises providing a multi-fimction register; defining a bypass 
mode for the register, wherein the register maintains the result of a multi-cycle scalar 
25 operation therein during such bypass mode; performmg a scalar operation a &st time; 
storing the result of the operation in the register m bypass mode; obtaining the result of the 
first operation directly from the register, and performing a scalar operation a second time 
using the result of the first operation obtained from the register. 

In an seventh aspect of the invention, improved methods for increasmg pipehne 
30 performance and efficiency by decoupUng certain signals, and allowing an existing pipeUne 
configuration to reveal more parallelism, are disclosed. The dataword fetch (e.g., ifetch) 
signal, which indicates the need to fetch instruction opcode/data from memory at the location 
being clocked into tiie program counter (pc) at the end of the current cycle, is made 
independent of the qualifying (validity) signal (e.g., ivaUd). AdditionaUy, the next program 
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counter value signal (e.g., next_pc) is made independent of the data word supplied by the 
memory controller (e.g., pliw) and ivalid. The hazard detection logic and control logic of the 
pipeline is further made independent of ivalid; i.e., the stage 1, stage 2, and stage 3 enables 
(enl, en2, en3) are decoupled from the ivalid (and pliw) signals, thereby decoupling pipeline 

5 movement. So-called "structural stalls" are further utilized when a slow functional unit, or 
operand fetch in the case of the xy memory extension, generates the next program counter 
signal (nextj)c). The jump instruction of the processor instmction set is also moved from 
stage 2 to 3, independent of ivalid. In this case, the jump address is held if the delay slot 
misses the cache and link. Additionally, delay slot instructions are not separated from their 

10 associated jump instruction. ^ 

hi an eighth aspect of the invention, an improved data cache apparatus xiseful within a 
pipelined processor is disclosed. The apparatus generally comprises logic which allows the 
pipeline to advance one stage ahead of the cache. Furthermore, rather than assuming that the 
pipeline will need to be stalled under all circumstances as in prior art pipeline control logic, the 

1 5 apparatus of the present allows the pipeline to move ahead of the cache, and only stalls vsdien a 
required data word is not provided (or other such condition necessitating a stall). Such 
conditional "latent" stalls enhance pipeline performance over the prior art configurations by 
elinainating conditions where stalls are unnecessarily invoked. In one exemplary embodiment, 
the pipelined processor comprises an extensible RISC-based processor, and -the logic 

20 comprises (i) bypass operand selection logic disposed in the execution stage of the pipeline, 
and (ii) a multi-function register architecture. 

In a ninth aspect of the invention, an improved method of reducing pipeline delays due 
to stalling using "latent" stalls is disclosed. The method generally comprises providing a 
processor having an instmction set and mxiltistage pipeline; adapting the processor pipeline to 

25 move at least one stage ahead of the data cache, thereby assuming a data cache hit; detecting 
the presence of at least one reqiured data word; and stalling the pipeline only when the 
reqiiired data word is not present. 

In a tenth aspect of the invention, an improved processor architecture utilizing one 
or more of the foregoing improvements including "atomic" instraction words, 

30 improved bypass logic, delayed breakpoint instruction decode, improved data cache 
architecture, and pipeline "decoupling" enhancements, is disclosed. In one exemplary 
embodiment, the processor comprises a reduced instruction set computer (RISC) having a 
four stage pipeliae comprising instruction fetch, decode, execute, and writeback stages, and 
"latent stall" data cache architecture which allows the pipeline to advance one stage ahead 
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of the cache. In another embodiment, the processor further includes an instruction set 
comprising at least one breakpoint instruction, the decoding of the breakpoint instruction 
being accomplished within stage 2 of the pipehne. The processor is also optionaUy 
configured with a multi-function register in a bypass configuration such that the result of 
5 one iteration of an iterative calculation is provided directly as an operand for subsequent 
iterations. 

Brief Description of the Drawings 
Fig. 1 is functional block diagram of a prior art pipelined processor brealqpoint 
instruction decode architecture (stage 1) Ulustrating the relationship between the instruction 
1 0 cache, instruction decode logic, and instruction request address selection logic. 

Fig. 2 is block diagram of a prior art processor bypass logic architecture illustrating 
the relationship of the bypass logic to the single- and multi-cycle fimctional units and 
registers. 

Fig. 3 is fimctional block diagram of a prior art pipelined processor data cache 
15 architecture iUustratmg the relationship between the data cache and associated execution 
stage lo^c. 

. Fig. 3a is graphical representation of pipeUne movement within a typical prior art 
processor pipeline architecture. 

Fig. 4 is block diagram illustrating a typical non-optimized prior art processor 
20 pipeline architecture and the relationship between various instructions and fimctional 
entities within the pipeline logic. 

Fig. 4a is a block diagram of components and instruction flow within the non- 
optimized prior art processor design of Fig. 4. 

Fig. 5 is logical flow diagram illustrating one embodiment of the long instruction 
25 word long immediate (limm) merge logic of the invention. 

Fig. 6 is a block diagram of one embodiment of the modified pipehne architecture 
and related fimctionaHties according to the present mvention, illustrating the enhanced path 
independraxce and parallelian thereof 

Fig. 7 is a fimctional block diagram of one exemplary embodiment of the pipeline 
30 logic arrangement of the invention, illustrating the decoupUng of the ivalid and pUw signals 
from the various other components of the pipeline logic. 

Fig. 8 is fimctional block diagram of one embodiment of the breakpoint instruction 
decode architecture (stage 1) of the present invention, illustrating the relationship between 
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the instruction cache, instruction decode logic, and instruction request address selection 
logic. 

Fig. 8a is a graphical representation of the movement of the pipeline of an 
exemplary processor incorporating the improved breakpoint instruction logic of the 
5 invention, wherein a breakpoint instruction located with in a delay slot. 

Fig. 8b is a graphical representation of pipeline movement wherein a breakpoint 
instruction normally handled within the pipeline when a delay slot is not present. 

Fig. 8c is a graphical representation of pipeline movement during stalled jump and 
branch operation according to the present invention. 
10 Fig. 9 is block diagram of one embodiment of the improved bypass logic 

architecture of the present invention, illustrating the use of a multi-function register within 
the execute stage of the pipeline logic between the bypass operand selection logic and the 
single- and multi-cycle functional units. 

Fig. 10 is a logical flow diagram illustrating one embodiment of the method of 
15 utilizing bypass logic to maximize processor performance during iterative calculations 
(such as snm-of products) according to the invention. 

Fig. 11 is a block diagram illustrating one exemplary embodiment of the modified 
data cache structure of the present invention. 

Fig. 11a is a graphical representation of pipeline movement in an exemplary 
20 processor incorporating the improved data cache integration according to the present 
invention. 

Fig. 12 is logical flow diagram illustrating the one exemplary embodiment of the 
method of enhancing the performance of a pipelined processor design according to the 
invention. 

25 Fig. 13 is a logical flow diagram illustrating the generalized methodology of 

synthesizing processor logic using a hardware description language (HDL), the synthesized 
logic incorporating the pipeline performance enhancements of the present invention. 

Fig. 14 is a block diagram of an exemplary RISC pipeUned processor design 
incorporating various of the pipeline performance enhancements of the present invention. 

30 Fig. 15 is a functional block diagram of one exemplary embodiment of a computer 

system useful for synthesizing gate logic implementing the aforementioned pipeline 
performance enhancements within a digital processor device. 
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Df^tai'l^H nescription 

Reference is now made to the drawings wherein like numerals refer to like parts 
tiiroughout. 

As used herein, the term "processor" is meant to include any integrated circuit or 
5 other electronic device capable of performing an operation on at least one instruction word 
including, without limitation, reduced instruction set core (RISC) processors such as the 
ARC™ user-configurable core manufactured by the Assignee hereof, central processing 
units (CPUs), and digital signal processors (DSPs). The hardware of such devices may be 
integrated onto a single piece of silicon ("die"), or distributed among two or more die. 
10 Furthermore, various functional aspects of the processor may be implemented solely as 
software or firmware associated with the processor. 

Additionally, it will be recognized by those of ordinary skill in the art that the term 
"stage" as used herein refers to various successive stages withm a pipeUned processor; i.e., 
stage 1 refers to the first pipelined stage, stage 2 to the second pipelined stage, and so forth. 
15 It is also noted that while the following description is cast in terms of VHSIC 

hardware description language (VHDL), other hardware description languages such as 
Verilog® may be used to describe various embodiments of the invention with equal 
success. Furthermore, while an exemplary Synopsys® synthesis engine such as the Design 
Compiler 2000.05 (DCOO) is used to synthesize the various embodiments set forth herein, 
20 other synthesis engines such as Buildgates® available from, inter alia. Cadence Design 
Systems, Inc., may be used. IEEE std. 1076.3-1997, IEEE Standard VHDL Synthesis 
Packages, describe an industry-accepted language for specifying a Hardware Definition 
Language-based design and the synthesis capabilities that may be expected to be available 
to one of ordinary skill in the art. 
25 Lastly, it is noted that as used in this disclosure, the terms "brealqjoint" and 

"breakpoint instruction" refer generally that class of processor instructions which result in 
an mterrupt or halting of at least a portion of the execution or processing of mstructions 
within the pipeline or associated logic vmits of a digital processor. As discussed in greater 
detail below, one such instruction comprises the "Brkx" class of instructions associated with 
30 the ARC™ extensible RISC processor previously referenced; however, it will be 
recognized that any number of different instmctions meeting the aforementioned criteria 
may benefit firom the methodology of the present invention. 

It will be noted that while the various methodologies of the invention are described 
herein in terms of a particular sequence of steps, such descriptions are only exonplary of 
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the broader methods. Accordingly, the sequence of performace of such steps may in many 
cases be pemiuted, and/or additional steps added. Other steps may be optional. All such 
variations are considered to fall within the scope of the claims appended hereto. 

Overview 

5 Pipelined CPU instruction decode and execution is a common method of providing 

performance enhancements for CPU designs. Many CPU designs offer programmers the 
opportunity to use instructions that span multiple words. Some multi-word instmctions 
permit a greater number of operands and addressing modes, while others enable a wider 
range of inmiediate data values. For multi-word immediate data, pipelined execution of 

10 instructions has some built-in limitations. As previously discussed, one of these limitations 
is the potential for an instraction containing long immediate data to be impacted by a 
pipeline stall before the long irmnediate data has been completely fetched from memory. 
This stalling of an incompletely fetched piece of data has several ramifications, one of 
which is that the otherwise executable instruction may be stalled before it is necessary. This 

15 leads to increased execution time and overhead, thereby reducing processor p^ormance. 

The present invention provides, inter alia, a way to avoid the stalling of long 
immediate data instructions so that performance is maximized. The invention jfurther 
eliminates a critical path delay in a typical pipelined CPU by treating certain multi-word long 
immediate data instructions as a larger or "atomic" multi-word oversized instruction. These 

20 larger instmctions are multi-word format instractions such as those employing long immediate 
data. Typical instmction types for the oversized instmctions disclosed herein include "load 
immediate" and "jxrnip" type instructions. 

Processor instmction execution time is critical for many applications; therefore, 
minimizing so-called "critical paths" within the decode phase of a multi-stage pipelined 

25 processor is also an important consideration. One approach to improving performance of the 
CPU in all cases is removing the speed path limitations. The present invention accomplishes 
removal of such path limitations by, inter alia, reducing the number of critical path delays in 
the control logic associated with instmction fetch and decode, including decode of brealqDoint 
instructions used during processes such as debug. By moving the breakpoint instmction 

30 decode from stage 1 (as in the prior art) to stage 2, the present invention eliminates the speed 
path constraint imposed by the breakpoint instmction; stage 1 instmction word decoding is 
advantageously removed from the critical path. 
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Delays in the pipeline are further reduced using the methods of the present invention 
through modifications to the pipeline hazard detection and control logic (and register 
structure), which effectively reveal more paralleUsm in the pipeline. Pipelining of operations 
which span multiple cycles is also utilized to increase paraUelism. 
5 The present invention forther advantageously pemuts the data cache to be integrated 

into the processor core in a manner that allows the pipeline to advance one stage ahead of the 
data cache. In the particular case of the aforementioned ARC- extensible RISC processor 
manufactured by the Assignee hereof, since the valid signal for returning loads (i.e.. "Idvalid-O 
does not necessarily influence pipeline movement, it can be assumed that the data cache will 
10 "hit" (i.e., contain the appropriate data value v^en accessed). Such cache hit allows the 
pipeline to move on to conduct further processing. If this assumption is wrong, and the 
requested data word is needed by an execution unit in stage 3, the pipeline can then be stalled, 
•nus "latent stall" approach improves pipeline performance significantly, since stalls wilhm 
the pipeline due to cache "misses" are invoked only on an as-needed basis. 
15 Appendbc I provides detailed logic equations in HDL format detailing the method of 

Hxe present invention in the context of the aforementioned ARC- extensible RISC processor 
core. It will be recognized, however, that the logic equations of Appendix I (and those 
specifically described in greater detail below) are exemplary, and merely illustrative of the 

broader concepts of the invention. 
20 While each of the improvement elements referenced above may be used in isolation, it 

should be recognized that these improvements advantageously may be used in combination. In 
particular, the combination of an instruction memory cache with the bypass logic will serve to 
maximize instruction execution rates. Likewise, the use of a data cache minimizes data related 
processor stalls. Combining the breakpoint fonction with memory caches mitigates the impact 
25 of the breakpoint function. Selection of combinations of these functions compromises 
complexity with performance. It will be appreciated that the choice of functions may be 
detennined by a number of factors including the end appHcation for which the processor is 
designed. 

3 0 "Atomic " Instructiom 

The invention in one aspect prevents enabling the host to halt the core while an 
instruction with long immediate values in stage 2 has not merged. This results in making 
the instructions contaming long immediate data non-stallable on the boundary between the 
instruction opcode and the immediate data. Consequently the instruction containing long 
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immediate data is treated as if the CPU was wider in word width for that instruction only. 
The foregoing functionality is specifically accomplished within the ARC™ core by 
comiecting the hold_host value to the instruction merge logic, i.e. p2_merge_valid_r and 
p21imm. Fig. 5 illustrates one exemplary embodiment of the logical flow of this 

5 arrangement. The method 500 generally comprises first determining whether, an instruction 
with long immediate (limm) data is present (step 502); if so the core merge logic is 
examined to determine whether merging in stage 2 of the pipeline has occurred (step 504). 
If merging has occurred (step 506), the halt signal to the core is enabled (i.e., "halt 
permissive" per step 508), thereby allovmig the core to be halted at any time upon initiation 

10 by the host. If merging has not occurred per step 506, then the core waits one instruction 
cycle (step 510) and then re-examines the merge logic to determine if merging has 
occurred. Accordingly, long immediate instructions cannot be stalled unless merging has 
occurred, which effectively precludes stalling on the instmction/immediate data word 
boundary. 

15 Appendix I hereto provides detailed logic equations (rendered in hardware 

description language) of one exemplary embodiment of the functionality of Fig. 5, 
specifically adapted for the aforementioned ARC core manufactured by the Assignee 
hereof. 

20 Enhanced Parallelism 

As previously shown in Fig. 4, the speed of each pipeline stage in the non-optimized 
prior art pipeline structure is bound by the slowest stage. Some functional blocks within the 
instruction fetch pipeline stage of the processor are not optimally placed within the pipeline 
structure. 

25 Fig. 6 illustrates the impact on pipeline operation of the methods of enhanced 

parallelism according to the present invention. The dark shaded blocks 602, 604, 606, 608, 
610 show areas of modification. These modifications, when implemented, produce 
significant improvements to the maximum speed of the core. Specifically, full pipelining of 
the blocks as in the present embodiment allows them to overlap with other blocks, and 

30 hence their propagation delay is effectively hidden. It is noted that these modifications do 
not change the instruction set architecture (ISA) in any way, but do produce slight changes 
in the timing of 64-bit instructions, instructions in delay slots, and jump indirect 
instructions which could need to bypass data words from slow execution units to generate 
nex^c. 



WO 01/69378 

16 

Fig. 7 is a block diagram of the modified pipeline architecture 700 according to one 
embodiment of the invention. In the modified architecture of Fig. 7, the slow cache path 
does not influence the control path (unlike that of the prior art approach of Figs. 4 and 4a), 
thereby reducing processor pipeline delays. Specifically. the ivalid signal 702 produced by 
5 the data word selection and cache "hit" evaluation logic 704 is latched into the first stage 
latch 706. Additionally, the long immediate instruction word (pliw) signal 708 resulting 
ftom the logic 704 is latched into the first stage latch 706. 

Using the arrangement of Fig. 7, the dataword fetch (ifetch) signal 717, which 
indicates the need to fetch instruction opcode or data fi:om memory at the location being 
10 clocked into the program counter (pc) at the end of the current cycle, is decoupled or made 
independent of the ivalid signal 702. This results in the instruction cache 709 ignoring the 
ifetch signal 717 (except when a cache invalidate is requested, or on start-up). 

Additionally, due to the latching arrangement of Fig. 7, the next program counter 
signal (nextpc) 716, which is indicative of the dataword address, is made independent of the 
15 word supplied by the memory controller (pliw) 708 and ivalid 702. Using this approach, 
nextpc is only valid when ifetch 717 is true (i.e., required opcode or dataword needs to be 
fetched by the memory controller) and ivalid is true (apart fi-om start-up, or after an 
ivalidate). Note that the critical path signal or unnecessarily slow signal is readily revealed 
when the "nextpc" path 416 is removed (dotted flow lines of Fig. 4a). 
20 The hazard detection logic 722 and pipelme control logic 724 is further made 

independent of the ivaUd signal 702; i.e., the stage 1, stage 2, and stage 3 enables (enl 727, 
en2 729, and en3 730, respectively) are decoupled firom the ivalid signal 702. Therefore, 
influence on pipeline movement by ivalid 702 is advantageously avoided. 

Mstructions with long immediate data are merged in stage 2. This merge at stage 2 
25 is a consequence of the foregoing independence of the hazard logic 722 and control logic 
724 firom ivaUd 702; since these instructions with long immediate data are made up of 
multiple multi-bit words (e.g., two 32-bit data words), two accesses of the instruction cache 
709 are needed. That is, an instruction with a long immediate should not move to stage 3 
until both the instruction and long immediate data are available in stage 2 of the pipeline. 
30 This requirement is also imposed for jump instructions with long immediate data values, hi 
current practice, the instruction opcode comes fi:om stage 2 and the long immediate data 
from stage 1 when a long immediate instruction is issued, that is, when the instruction 
moves to stage 3. 
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The present invention further utilizes "structural stalls" to enhance pipeline 
performance such as when a slow functional vmit (or operand fetch in the case of the xy 
memory extension) generates nextpc 716 (that is, jump register indirect instructions, j [rx], 
where the value of rx can be bypassed from a functional unit). As used herein, the term 

5 "structural stalls" refers to stall requirements that are defined by limitations inherent in the 
functional unit. One example of a structural stall is the operand fetch associated with the 
XY memory extension of the ARC processor. This approach advantageously allows slow 
forwarding paths to be removed, by prematurely stalling the impeding operation. For 
example, new program counter (pc) values are rarely generated by multipliers; if such 

10 values are generated by tlie multiplier, they can result in a cycle delay that is a 1 cycle stall 
or bubble, and allow nextjc to be obtained from the register file 731. Li general, the 
present invention exploits the stall that is inherent in generating a next PC address which is 
not sequentially linear in the address space. This occurs when a new PC value is calculated 
by an instruction such as jimip. In addition, it may be appreciated that certain instruction 

15 sets permit arithmetic and logic operations to directly a new PC. Such computations also 
introduce a structural stall which under some circumstances may be exploited to continue 
operation of the CPU. 

In addition to the foregoing, the present invention further removes or optimizes 
remaining critical paths within the processor using selective pipelining of operations. 
20 Specifically, paths that can be extended over more than one processor cycle with no 
processor performance loss can be selectively pipelined if desired. As an example, the 
process of (i) activating sleep mode, (ii) stopping the core, and (iii) detecting a breakpoint 
instruction, does not need to be performed in a single cycle, and accordingly is a candidate 
for such pipelining. 



Breakpoint Instruction Decode Architecture 

Referring now to Fig. 8, one embodiment of the modified breakpoint architecture of 
the invention is described. As illustrated in Fig. 8, the architecture 800 comprises generally 
a first stage latch (register) 801, an instruction cache 802, instruction request selection logic 
30 804, an intermediate (e.g., second stage) latch 806, and instruction decode logic 808. The 
instruction cache 802 stores or caches instructions received from the latch 801 which are to 
be decoded by the insti-uction decode logic 808, thereby obviating at least some program 
memory accesses. The design and operation of instruction (program) caches is well known 
in the art, and accordingly will not be described further here. The instruction word(s) stored 
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within the instruction cache 802 is/are provided to the instruction request address selection 
logic 804, which utilizes the program counter (nextpc) register to identify the next 
instruction to be fetched, based on data 810 (e.g., 16-bit word) from the instruction decode 
logic 808 and the current instruction word. This data includes such information as condition 
5 codes and other instruction state information, assembled into a logical collection of 
information which is not necessarily physically assembled. For example, a condition code 
by itself may select an alternative instruction ot be fetched. The address from which the 
instruction is to be fetched may be identified by a ariety of words such as the contents of a 
register or a data word from memory. The instraction word provided to the instruction 
10 request logic 804 is then passed to the intermediate latch 806, and read out of that latch on 
the next successive clock cycle by the instruction decode logic 808. 

Hence, in the case of a breakpoint instruction, the decode of the instruction (and its 
subsequent execution) in the present embodiment is delayed until stage 2 of the pipeline. 
This is in contrast to the prior art decode arrangement (Fig. 1), wherein the instruction 
15 decode logic 808 is disposed immediately following the instruction cache 802, thereby 
providing for immediate decode of a breakpoint instruction after it is moved out of the 
instruction cache 802 (i.e., in the first stage), which places the decode operation in the 
critical path. 

Additionally, in order to move the breakpoint instruction decode to stage 2 as 
20 described above, the program counter (pc) of the present embodiment is changed from the 
current value to the breakpoint pc value through a simple assigmnent. This modification is 
required based on timing considerations; specifically, by the time the brealqwint instruction is 
decoded, the pc has already been updated to point to the next instruction. Hence, the pc value 
must be "reset" back to the brealqpoint instruction value to account for this decoding delay. 
25 The following examples illustrate the operation of the modified biealqjoint 

instruction decode architecture of the present invention m detail. 

Example 1- Delay Slot 

Fig. 8a and the discussion foUowing hereafter illustrate how a breakpoint instruction 
located with in a delay slot is processed using the present invention. As is well known in 
30 the digital processing arts, delay slots are used in conjunction with certain instruction types 
for including an instruction which is executed during execution of the parent instruction. 
For example, a "jump delay slot" is often used to refer to the slot within a pipeline 
subsequent to a branching or jump instruction being decoded. The instruction after the 
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branch (or load) is executed while awaiting completion of the branch/load instruction. It 
will be recognized that while the example of Fig. 8a is- cast in terms of a breakpoint 
instruction disposed in the delay slot after a "Jump To" instruction, other applications of 
delay slots may be used, whether alone or in conjunction with other instruction types, 
5 consistent with the present invention. 

Note that as used herein, the nomenclature " <name><Address>" refers to the 
instruction name at a given address. For example, "J.dA" refers to a "Jump To" instruction 
at address A. 

In step 820 of Fig. 8a, an instruction (e.g, "Jump To" at address A, or "J.dA") is 
10 requested. Next, the breakpoint instruction at address B (Brke) is requested in step 822. In 
step 824, the target address at address C (Targetc) is requested. The target address is saved 
in the second operand register or the long immediate register of the processor in the 
illustrated example. The instruction in the fetch stage is killed. 

Next, in step 826, the breakpoint instruction of step 822 above (Brke) is decoded. 
15 The cinrent pc value is updated with the value of lastpc, the address of Brke rather than the 
address of Targetc, as previously described. An extra state is also implemented in the 
present embodiment to indicate (i) that a 'breakpoint restart' is needed, and (ii) if the 
breakpoint instmction was disposed in a delay slot (which in the present example it is). 

In step 828, the "Jump To" instruction J.dA completes, and once all other multi- 
20 cycle instmctions have completed, the core is halted, reporting a break instruction. Next, in 
step 830, the host takes control and changes Brkfi to Adds (for example, by a "write" to 
main memory). The host then invalidates the memory mapping of address B by either 
invalidating the entire cache or invalidating the associated cache line. The host then starts 
the core running. 

25 After the core is running, the add instruction at address B, Adds, is fetched using the 

current program counter value (currentpc) in step 832. Then, in step 834, the target value at 
address C (Targetc) is requested, using the target address from stage 3 of the pipeline. The 
current program counter value (currentpc) is set equal to the Targetc address. In step 836, 
Target2c is requested. Lastly, in step 838, the Target3c is requested. 

30 Note that in the example of Fig. 8a above, the breakpoint instruction execution is 

complicated by the presence of a delay slot. This requires the processor to restart operation 
at the delay slot after the completion of the breakpoint instruction. The instruction at the 
delay slot address is then executed, followed by the instruction at the address specified by 
the jump instruction. The program continues from the target address. 
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Example 2 - Non-delay Slot Breahpoint Use 

Fig. 8b and subsequent discussion iUustrate how a breakpoint instruction is 
normaUy handled within the pipeline when a delay slot is not present. 
5 First, in step 840, an add at address A (AddA) is requested. A breakpoint instruction 

at address B (Brke) is then requested in step 842. A "move" at address C (Move) is next 
requested in step 844. The instruction in the fetch stage (stage 1) is killed. The breakpoint 
instruction (BrkB) is next decoded in step 846. The current pc value is updated with the 
value of lastpc, i.e., the address of Brke rather than the address of the instmction foUowing 

10 Move- Move is killed. 

Next, in step 848, the AddA instruction completes, and once all other multi cycle 
instructions (including delayed loads) have completed, the processor is halted, reporting a 
break instruction. The host then takes control in step 850, changing Brks to Adds (such as 
by a write to main memory). The host then invalidates the memory mapping of address B 
15 by either invalidating the entire cache or invalidating the associated cache Une. The host 
then starts the core rumiing again per step 850. 

In step 852, the add instruction at address B (Adds) is fetched using the current 
address in the program counter (currentpc). A move at address C (Move) is again requested 
in step 854. Mov2c is then requested in step 856, and lastly Mov3e is requested m step 858. 

20 

Example 3 - Stalled Jump and Branches 

Referring now to Fig. 8c, in step 860, the jump instruction J.dA is requested. The 
breakpoint instruction (Brke) is next requested in step 862. Targetc is next requested in step 
864. The target address is saved in the second operand register or the long immediate 
25 register in the iUustrated embodiment, although it will be recognized that other storage 

locations may be utilized. 

The breakpoint instruction (Brke) is next decoded in step 866. Current pc is 
updated with the value of lastpc, the address of Brks rather than the address of Targetc. As 
with the example of Fig. 8b above, an extra state is added to indicate (i) that a 'breakpoint 
30 restart' is needed, and (ii) if the breakpoint mstruction was in a delay slot The "Jump To" 
instruction J.dA is stalled in stage 3 since, inter alia, it may be a link jump. Once all other 
multi cycle instructions have completed the core is halted, and a break instruction reported, 
hi step 868, the host takes control and changes Brke to Targetc. The host then invaUdates 
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the memory mapping of address B by either invalidating the entire cache or invalidating 
the associated cache line. The host then starts the core running iii step 870. 

The add instruction at address B (AddB) is next fetched using the address of the 
currentpc. In step 874, Targetc is requested, using the target address from stage 3 (execute) 
5 of the pipeline. The currentpc address is set equal to the Targetc address. Target2c is then 
requested per step 876, and TargetSc is requested per step 878. 

Note that in the example of Fig. 8c, the breakpoint instruction is disposed in a delay 
slot, but the processor pipeline is stalled. The breakpoint instruction is held for execution 
until the multi-cycle instructions have completed executing. This limitation is imposed to 
10 prevent leaving the core in a state of partial completion of a multi-cycle instruction during 
the breakpoint ihstmction execution. 

Bypass Logic 

Referring now to Fig. 9, the bypass logic 900 of the present invention comprises 

15 bypass operand selection logic 902, one or more single cycle functional units 904, one or 
more multi-cycle functional units 906, result selection logic 908 operatively coupled to the 
output of the single cycle functional units, a register 910 coupled to the output of the result 
selection logic 908 and the multi-cycle functional \mits 906, and more multi-cycle 
functional units 912 and result selection logic 914 coupled sequentially to the output of the 

20 register 910 as part of the second execute stage 920. A second register 918 is also coupled 
to the output of the result selection logic 914. A return path 922 connects the output of the 
second stage result selection logic 914 to the input of a third "multi-function" register 924, 
the latter providing input to aforementioned bypass operand selection logic 902. A similar 
return path 926 is provided from the output of the first stage result selection logic 908 to the 

25 input of the third register 924. As used herein, the term "single-cycle" refers to instructions 
which have only one execute stage, while the term "multi-cycle" refers to instructions 
having two or more execute stages. Of particular interest are the instructions that are multi- 
cycle by virtue of a need to load long immediate data. These instructions are formed, e.g., 
by two sequential instruction words in the instruction memory. The first of the words 

30 generally includes the op-code for the instmction, and potentially part of the long 
immediate data. The second word is made up of all (or the remainder) of the long 
immediate data. 

By employing the bypass arrangement of Fig. 9, the present invention replaces the 
register or memory location used in prior art systems such as that illustrated in Fig. 2 with a 
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special register 924 tliat serves multiple purposes. When used in a "bypass" mode, the 
special register 924 retains the summation result and presents the summation result as a 
value to be selected from by an instruction. The result is a software loop that can execute 
nearly as fast as custom-built hardware. The execution pipeline fills with the instructions to 
perform the sum of products operation and the bypass logic permits the functional units to 
operate at peak speed without any additional addressing of memory. Other functions of this 
register 924 (in addition to the aforementioned "bypass" mode operation) include (i) 
latching the source operands to permit fully static operation, and (ii) providing a centralized 
location for synchronization signal/data movement 

As can be seen from Fig. 9, the duration for single cycle instructions in the present 
embodiment of the pipeline is unchanged as compared to that for the prior art arrangement 
(Fig. 2); however, multi-cycle instmctions benefit from the pipeUne arrangement of the 
present invention by effectively removing the bypass logic during the last cycle of the 
multi-cycle execution. Note that in the case of single cycle instructions, the bypass logic is 
15 not on the critical path because the datapath is sequenced to permit delay-free operation. By 
moving the latches (register) 924 to the front of the datapath as in Fig. 9, the second and 
subsequent cycles required for instruction execution are provided with additional time. This 
additional time comes from the fact that there are no additional decoding delays associated 
with the logic for the fiinctional units and operand selection, and because the register 924 
20 may be clocked by a later pipeline stage. Smce a later stage clock signal may be used to 
clock the register, the register latching is accomplished prior to the clock signal associated 
^th the operand decode logic. Hence, the operand decode logic is not "left waiting" for the 

latching of the register 924. 

hi one exemplary design of the ARC™ core incorporating the bypass logic 
fimctionality of the invention as described above with respect to Fig. 9, the decode logic 
900 and functional units 904, 906 are constrained to be minimized simultaneously. This 
constraint during design synthesis advantageously produces one fewer level of gate delay in 
the datapath as compared to the design resulting if such constraint is not imposed, thereby 
further enhancing pipeline performance. It will be appreciated that this refinement is not 
30 neceaasry to practice the essence of the invention, but serves to fiirther the perfromance 
enhancement of the invention. 

The results of the previous operation (specifically, in the forgoing sum-of-products 
example, the sum from a given iteration) are provided to the multi-fimction register 924 
which in turn provides the sum value directly to the input of the bypass operand selection 



25 



)CID: <WO_0169378A2_I_> 



■m 




wo 01/69378 



PCT/USOl/07360 



23 



logic 902. In this fashion, the bypass operand selection logic 902 is not req\iired to access a 
memory location or another register repeatedly to provide the operands for the summation 
operation. 

It is also noted that the present invention may advantageously be implemented 
5 "incrementally" by moving lesser amovmts of the bypass logic to the execution stage (e.g., 
stage 3). For example, rather than moving all bypass logic to stage 3 as described above, 
only the logic associated with bypassing of late arriving results of functional units can be 
moved to stage 3. It will be appreciated that differing amounts of logic optimization will be 
obtained based on the amount of bypass logic moved to stage 3. 
10 In addition to the structural improvement in performance as previoiisly described 

(i.e., obviating memory/register accesses during each iteration of multi-cycle instructions, 
thereby substantially reducing the total number of memory/register accesses performed 
during any given iterative calculation), there are several additional benefits provided by 
employing the bypass logic arrangement of the present invention. One such benefit is that 
15 by removing the interposed register between the bypass operand selection and the 
functional \mits (shown in Fig. 2), design compilers can better optimize the generated logic 
to maximize speed and/or minimize the number of gates in the design. Specifically, the 
design compiler does not have to consider and account for the presence of the register 
interposed between the bypass operand selection logic and the single/multi-cycle fiinctional 



VHDL simulations potentially execute faster and simplifies fault analysis and coverage. 

In suHL, two primary benefits are derived fi-om the improved bypass logic design 

25 described above. The first benefit is the ability to manage late arriving results firom the 
functional units more efficiently. The second benefit is that there is better logic 
optimization within the device. 

The first benefit may be obtained by only moving the minimum required portion of 
the logic to the improved location. The second benefit may be attained in varying degrees 

30 by the amoimt of logic that is moved to the new location. This second benefit derives at 
least in part from the synthesis engine's improved ability to optimize the results. The ability 
to optimize the results stems firom the way in which the exemplary synthesis engine 
functions. In specific, synthesis engines generally treat all logic between registers as a 
single block to be optimized. Blocks that are divided by registers are optimized only to the 
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units. 



Another benefit is that by grouping the registers and logic in the improved fashion 
of Fig. 9, the bypass function is better isolated from the rest of the design. This makes 
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registers. By moving the operand selection logic so that no registers are interposed between 
it and the functional unit logic, the synthesis engine can perform a greater degree of 
optimization. 

More detail on the design synthesis process mcorporating the bypass logic of the 
5 present invention is provided herein with respect to Fig. 1 3 . 

Referring now to Fig. 10, a method for operating the pipeline of a pipelined 
processor which facilitates the bypass of various components and registers so as to 
maximize processor performance during iterative operations (e.g., sum of products) is 
disclosed. The first step 1002 of the method 1000 comprises providing a multi-function 
10 register 914 such as that described with respect to Fig. 9 above. TTiis register is defined in 
step 1004 to include a "bypass mode", wherein during such bypass mode the register 
maintains the result of a multi-cycle scalar operation therein. In this fashion, the bypass 
operand selection logic 902 is not required to access memory or another location to obtain 
the operand (e.g.. Sum value) used in the iterative calculation as in prior art architectures. 
15 Rattier, the operand is stored by the register 914 for at least a part of one cycle, and 
provided directly to the bypass operand selection logic using decode information firom the 
instruction to select register 914 directly without the need for any address generation. This 
type of register access differs from the general purpose register access present in RISC 
CPUs in ttiat no address generation is requked. General purpose register access requires 
20 register specification and/or address generation which consumes a portion of an instruction 
cycle and requires the use of the address generation resource of the CPU. The register 
employed in the bypass logic is an "impUed" register that is specified by the instruction 
being executed without the need for a separate register specification. For certam 
instructions the registers of the datapath may fimction the same as an accumulator or other 
25 register. The value stored in the datapath register is transferred to a general purpose register 
during a later phase of the pipeline operation. In the meantime, iteration or other operations 
continue to be processed at full speed. 

Next, in step 1006, a multi-cycle scalar operation is performed by the processor a 
first time. In the foregomg example of the sum-of-products calculation, such an operation 
30 comprises one iteration of the "Multiply" and "Sum" sub-operations, the result of the Sum 
sub-operation being provided back to the multi-fimction register 914 per step 1008 for 
direct use in the next iteration of the calculation. 

In step 1010, the result of the previous iteration is provided directly from the 
register 914 to ttie bypass operand selection logic 902 via a bus element 
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Lastly, a second iteration of the operation is performed using the result of the first 
operation firom the register 914, and another operand supplied by the address generation 
logic of the RISC CPU. The iterations are continued until the multi-cycle operation is 
completed (step 101 1), and the program flow stopped or other wise continued (step 1012). 



Data Cache Integration 

Integration of the data cache can have a profound effect on the speed of the 
processor. In general, the modified control of the data cache according to the present 
invention is accomplished through data hazard control logic modifications. The following 
10 discussion describes several enhancements to the prior art data cache integration scheme of 
Fig. 3 made by the present invention, including (i) assumption of data cache "hit" unless a 
"miss" actually occurs; (ii) improved instruction request address generation; and (iii) 
relocation of bypass logic fi"om stage 2 (decode) to stage 3 (execute). It should also be 
noted that some of these modifications provide other benefits in the operation of the core in 
15 addition to improved pipeline performance, such as lower operiating power, reduced 
memory accesses, and improved memory performance. 

Referring now to Figs. 1 1 and 1 la, the improved data cache structure and method of 
the present invention is described in further detail. 

One embodiment of the improved data cache architecture is shown in Fig. 1 1, in the 
20 context of the multi-stage pipeline of the aforementioned ARC™ RISC processor. The 
architecture 1100 comprises a data cache 1102, bypass operand selection logic 1104 
(decode stage), result selection logic 1106 (2 logic levels), latch control logic 1108 (2 
levels), program counter (nextpc) address selection logic 1 1 10 (2 levels), and cache address 
selection logic 1112 (2 levels), each of the logic units 1106, 1108, 1112 operatively 
25 supplying a third stage latch (register) 1116 disposed at the end of the second execution 
stage (E2) 1118. Summation logic 1111 is also provided which sums the outputs of the 
bypass operand selection logic 1104 prior to input to the mxiltiplexers 1120, 1122 in the 
data cache 1 1 02. 

In addition to the multiplexers 1120, 1122, the data cache 1102 comprises a 
30 plurality of data random access memory (RAM) devices 1126 (0 through w-1), further 
having two sets of associated tag RAMs 1127 (0 through w-1) as shown. As used herein, 
the variable "w" represents the nxmiber of ways that a set associative cache may be 
searched. In general, w corresponds to the width of the memory array in multples of a 
word. For example, the memory may be two words wide (w=2) and the memory is then 



wo 01/69378 W ^T/.US01/07360 

26 ^ 

divided into two banks for access. The output of the data RAMs 1126 is multiplexed using 
a (w-1) channel multiplexer 1131 to the input of the byte/word/long word extraction logic 
1 1 32, the output of which is the load value 1134 provided to the result selection logic 1 1 06. 
The output of each of the tag RAMs 1127 is logically ORed with the output of the 
5 summation logic 1 11 1 in each of the 0 through w-1 memory units 1 138. The outputs of the 
memory units 1138 are input in parallel to a logical "OR" function 1139 which determines 
the value of the load valid (IdvaUd) signal 1 140, the latter being input to the latch control 
logic 1 1 08 prior to the third st^e latch 1116. 

In comparison to the prior art arrangement of Fig. 3 previously described, the 
10 present embodiment has relocated the bypass operand selection logic from the decode stage 
(and E2 stage) of the pipeline to the first execute stage (El) as shown in Fig. 11. 
Additionally, tiie nextpc address selection logic 1110 receives the load value immediately 
after the data cache multiplexer 1131, as opposed to receiving tiie load value after tiie 
results selection logic as in Fig. 3. The valid signal for returning loads (Idvalid) 1140 is also 
15 routed directly to the two-level latch control logic 1 108, versus to tiie pipelme control and 
hazard detection logic as in Fig. 3. 

The foregoing modifications provide the following functionaUty: 

(i) Assumption of data cache "hit" - In contrast to tiie prior art approach of Figs. 3 
and 3a, the Idvalid signal 1140 does not influence pipeline movement in the present 

20 invention, since it is decoupled fix>m tiie control logic and hazard detection logic. Raflier, it 
is assumed tiiat tiie data cache will "hit", and tiierefore tiie pipelme will continue to move. 
If tills assumption is wrong, and tiie requested dataword is needed by an execution unit in 
tiie execution stage (El or E2), tiie pipeline is stalled at tiiat point When tiie data cache 
1102 makes the dataword available to the execution unit in need thereof, the operand for 

25 the instruction in the decode stage is updated. 

(ii) Instruction Request Address Generation - Word or byte extracted load results do 
not usually generate tiie instruction request address for a jump register indirect instruction 
(e.g., j [rx]). Therefore, as part of tiie present invention, the instruction request address is 
generated earlier by tiie next address selection logic of figure 11, and a jump register 
30 indirect address where tiie register value is bypassed from a load byte or word causes a 
stmctural pipeline stall. 
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(iii) Relocation of Bypass Logic - As illustrated in Fig. 11, the present invention 
also relocates the bypass operand selection logic from stage 2 (decode) to stage 3 (execute 
El), and from execute E2 to El, to allow the multi-cycle/multi-stage fimctional units cache 
extra time on all cycles but the first. 

5 Fig. 11a graphically illustrates the movement of the pipeline of an exeplary processor 

configured with the data cache integration improvements of the present invention. Note that 
the im-dashed bypass arrow 1170 indicates prior art bypass logic operation, while the dashed 
bypass arrow 1172 indicates bypass logic if it is moved from stage 2 to 3 according to the 
present invention. The following provides and explanation of the operation of the data cache 
10 ofFig. 11a. 

In step 1174, a load (Ld) is requested. Next, a Mov is requested per step 1176. An 
Add is then requested per step 1 178. In step 1180, the Ld begins to execute. In step 1 182, the 
Mov begins to execute, and the cache misses. The Mov operation moves througji the pipeline 
per step 1184. The Add operation stalls in execute stage El, since the cache missed and the 

15 Add is dependent on the cache result. The cache then returns the Load Result Value per step 
1186, and the Add is computed per step 1 188. The Add moves through the pipeline per step 
1 190, the Add result is written back per step 1 192. 

As illustrated m Fig. 11a, the improved method of data cache integration of the present 
invention reduces the number of stalls encountered, as well as the impact of a cache "miss" 

20 (i.e., condition where the instruction is not cached in time) during the execution of the 
program. The present invention resulte in the add instraction continuing to move through the 
pipeline until reference T saving instmction cycles. Further, by delaying pipeline stalls, the 
overall performance of the processor is increased. 



25 Method of Enhancing Performance of Processor Design 

Referring now to Fig. 12, a method of enhancing the performance of a digital 
processor design such as the extensible ARC^^ of the Assignee hereof is described. As 
illustrated in Fig. 12, the method generally comprises first providing a processor design 
which is non-optimized (step 1202), including inter alia critical path signals which 

30 ujmecessarily delay the operation of the pipeline of the design. For example, the non- 
optimized prior art pipeline(s) of Figs. 1 through 4a comprises such designs, although 
others may clearly be substituted. In the present embodiment of the method, the processor 
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design fiirther includes an instruction set having at least one breakpoint instruction, for 
reasons discussed in greater detail below. 

Next, in step 1204, a program comprising a sequence of at least a portion of the 
processor's instruction set (including for example the aforementioned breakpoint 
5 instruction) is generated. The breakpoint instruction may be coded within a delay slot as 
previously described with respect to Fig. 8a herein, or otherwise. 

Next, in step 1206, a critical patii signal within the processing of program wititin tiie 
pipeline is identified. In the illustrated embodiment, tiie critical path is associated witii tiie 
decode and processing of the breakpoint instruction. The critical path is identified tiarough 
10 use of a simulation runnmg a simulation program such as the "Viewsun™" program 
manufactured by Viewlogic Corporation, or other sunilar software. Fig. 4a illustiates the 
presence of a critical path signal in tiie dataword address (e.g., nextpc) generation logic of a 
typical processor pipeline. 

Next, in step 1208 tiie architecture of the pipeline logic is modified to remove or 
15 mitigate tiie delay effects of tiie non-optimized pipeline logic architectine. hi tiie illustiated 
embodunent, tiiis modification comprises (i) relocating the instruction decode logic to tiie 
second (decode) stage of tiie pipeline as previously described witii reference to Fig. 8, and 
(ii).mcluding logic which resets tiie program counter (pc) to tiie breakpomt address, as 
previously described. 

20 The simulation is next re-run (step 1210) witii tiie modified pipeline configuration 

to verify tiie operability of tiie modified pipelme, and also determine tiie unpact (if any) on 
pipelme operation speed. The design is tiien re-syntiiesized (step 1212) based on tiie 
foregoing pipeline modifications. The foregoing steps (i.e., steps 1206, 1208, 1210, and 
1212, or subsets tiiereof) are optionally re-performed by tiie designer (step 1214) to further 

25 refine and unprove tiie speed of tiie pipeline, or to optimize for otiier core parameters. 



Method of Synthesizing 

Referring now to Fig. 13, tiie metiiod 1300 of syntiiesizmg logic incorporating tiie 
long instruction word fimctionaUty previously discussed is described. The generalized 
30 metiiod of syntiiesizing integrated circuit logic having a user-customized (i.e., "soft") 
mstruction set is disclosed in Applicant's co-pendmg U.S. Patent Application Serial No. 
09/418,663 entitied "Metiiod And Apparatus For Managing The Configuration And 
FunctionaUty Of A SemiconcJuctor Design" filed October 14, 1999, which is incorporated 
herein by reference in its entirety. 
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While the following description is presented in tenns of an algorithm or computer 
program running on a microcomputer or other similar processing device, it can be 
appreciated that other hardware environments (including riiinicomputers, workstations, 
networked computers, "supercomputers", and mainframes) may be used to practice the 
5 method. Additionally, one or more portions of the computer program may be embodied in 
hardware or firmware as opposed to software if desired, such alternate embodiments being 
well within the skill of the computer artisan. 

Initially, user input is obtained regarding the design configuration in the first step 
1302. Specifically, desired modules or functions for the design are selected by the user, and 
10 instructions relating to the design are added, subtracted, or generated as necessary. For 
example, in signal processhig applications, it is often advantageous for CPUs to include a 
siD^e "multiply and accumulate" (MAC) instmction. In the present invention, the instruction 
set of the synthesized design is further modified so as to incorporate the desired aspects of 
pipeline performance enhancement (e.g. "atomic" instruction word) therein. 
15 The technology library location for each VHDL file is also defined by the user in step 

1302. The technology library files in the present invention store all of the information related 
to cells necessary for the synthesis process, including for example logical function, 
input/output timing, and any associated constraints. In the present invention, each user can 
define his/her own library name and location(s), thereby adding fiirther flexibility. 
20 Next, in step 1303, the user creates customized HDL fimctional blocks based on the 

user's input and the existing library of functions specified in step 1 302. 

In step 1304, the design hierarchy is determined based on usct input and the 
aforementioned library files. A hierarchy file, new library file, and makefile are subsequently 
generated based on the design hierarchy. The term "makefile" as used herein refers to the 
25 commonly used UNIX makefile function or similar fimction of a computer system well known 
to those of skill in the computer programming arts. The makefile function causes other 
programs or algorithms resident in the computer system to be executed in the specified order- 
In addition, it fiarther specifies the names or locations of data files and other information 
necessary to the successful operation of the specified programs. It is noted, however, that the 
30 invention disclosed herein may utilize file structures other than the "makefile" type to produce 
the desired fimctionality. 

In one embodiment of the makefile generation process of the present invention, the 
user is interactively asked via display prompts to input information relating to the desired 
design such as the type of "build" (e.g., overall device or system configuration), width of 
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the external memory system data bus, different types of extensions, cache type/size, etc. 
Many other configurations and sources of input information may be used, however, 
consistent with the invention. 

In step 1306, the user runs the makefile generated in step 1304 to create the 
5 structural HDL. This structural HDL ties the discrete functional block in the design 
together so as to make a complete design. 

Next, in step 1308, the script generated in step 1306 is run to create a makefile for 
the simulator- The user also runs the script to generate a synthesis script in step 1308. 

At this point in the program, a decision is made whether to synthesize or simulate 
10 the design (step 1310). If simulation is chosen, the user runs the simulation using the 
generated design and simulation makefile (and user program) in step 1312. Alternatively, if 
synthesis is chosen, the user runs the synthesis using the synthesis script(s) and generated 
design in step 1314. After completion of the synthesis/simulation scripts, the adequacy of 
the design is evaluated in step 1316. For example, a synthesis engine may create a specific 
15 physical layout of the design that meets the performance criteria of the overall design 
process yet does not meet the die size requirements. In this case, the designer will make 
changes to the control files, Ubraries, or other elements that can affect the die size. The 
resulting set of design information is then used to re-run the synthesis script. 

If the generated design is acceptable, the design process is completed. If the design 
20 is not acceptable, the process steps beginning with step 1302 are re-performed until an 
acceptable design is achieved. In this fashion, the method 1300 is iterative. 

Fig. 14 illustrates an exemplary pipelined processor fabricated using a 1.0 urn 
process. As shown in Fig. 14, the processor 1400 is an ARC™ microprocessor-like CPU 
device having, inter alia, a processor core 1402, on-chip memory 1404, and an external 
25 interface 1406. The device is fabricated using the customized VHDL design obtained using 
the method 1300 of the present invention, which is subsequently synthesized into a logic 
level representation, and then reduced to a physical device using compilation, layout and 
fabrication techniques well known in the semiconductor arts. For example, the present 
invention is compatible with 0.35, 0.18, and 0.1 micron processes, and ultimately may be 
30 applied to processes of even smaUer or other resolution. An exemplary process for 
fabrication of the device is the 0.1 micron "Blue Logic" Cu-11 process offered by 
Mtemational Business Machines Corporation, although others may be used. 

It will be appreciated by one skilled in the art that the processor of Figure 14 may 
contain any commonly available peripheral such as serial communications devices, parallel 
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ports, timers, counters, high ciirrent drivers, analog to digital (A/D) converters, digital to 
analog converters (D/A), interrupt processors, LCD drivers, memories and other similar 
devices- Further, the processor may also include custom or application specific circuitry, 
including an RF transceiver and modulator (e.g., Bluetooth™ compliant 2,4 GHz 

5 transceiver/modulator), such as to form a system on a chip (SoC) device useful for 
providing a number of different functionalities in a single package. The present invention is 
not limited to the type, number or complexity of peripherals and other circuitry that may be 
combined using the method and apparatus. Rather, any limitations are imposed by the 
physical capacity of the extant semiconductor processes which improve over time. 

10 Therefore it is anticipated that the complexity and degree of integration possible employing 
the present invention will further increase as semiconductor processes improve. 

It is also noted that many IC designs cxirrently use a microprocessor core and a DSP 
core. The DSP however, might only be required for a limited nxmaber of DSP functions, or 
for the IC's fast DMA architecture. The invention disclosed herein can support many DSP 

15 instraction functions, and its fast local RAM system gives immediate access to data. 
Appreciable cost savings may be realized by using the methods disclosed herein for both 
the CPU & PSP functions of the IC. 

Additionally, it will be noted that the methodology (and associated computer 
program) as previously described herein can readily be adapted to newer manufacturing 

20 technologies, such as 0.18 or 0.1 micron processes (e.g. "Blue Logic™" Cu-11 process 
offered by IBM Corporation), with a comparatively simple re-synthesis instead of the 
lengthy and expensive process typically required to adapt such technologies using "hard" 
macro prior art systems. 



25 synthesizing logic structures capable of implementing the pipeline performance 
enhancement methods discussed previously herein is described. The computing device 
1500 comprises a motherboard 1501 having a central processing unit (CPU) 1502, random 
access memory (RAM) 1504, and memory controller 1505. A storage device 1506 (such as 
a hard disk drive or CD-ROM), input device 1507 (such as a keyboard or mouse), and 

30 display device 1 508 (such as a CRT, plasma, or TFT display), as well as buses necessary to 
support the operation of the host and peripheral components, are also provided. The 
aforementioned VHDL descriptions and synthesis engine are stored in the form of an object 
code representation of a computer program in the RAM 1504 and/or storage device 1506 
for use by the CPU 1502 during design synthesis, the latter being well known in the 



Referring now to Fig. 15, one embodiment of a computing device capable of 
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computing arts. The user (not shown) synthesizes logic designs by inputting design 
configuration specifications into the synthesis program via the program displays and the 
input device 1507 during system operation. Synthesized designs generated by the program 
are stored in the storage device 1506 for later retiieval, displayed on the graphic display 
device 1508, or output to an external device such as a printer, data storage unit, fabrication 
system, olher peripheral component via a serial or parallel port 1512 if desired. 

It will be recognized that while certain aspects of the invention are described in 
terms of a specific sequence of steps of a method, these descriptions are only illustrative of 
the broader methods of the invention, and may be modified as required by the particular 
application. Certain steps may be rendered unnecessary or optional under certain 
circumstances. Additionally, certain steps or fvmctionality may be added to the disclosed 
embodiments, or the order of performance of two or more steps permuted. All such 
variations are considered to be encompassed witiiin the invention disclosed and claimed 
herein. 

15 While the above detailed description has shown, described, and pointed out novel 

featiires of the invention as appUed to various embodiments, it will be understood that 
various omissions, substitutions, and changes in the form and details of the device or 
process illustirated may be made by those skilled in the art without departing from the 
invention. The foregoing description is of the best mode presently contemplated of carrying 

20 out the invention. This description is in no way meant to be limiting, but ratiier should be 
taken as illustrative of tiie general principles of the invention. The scope of tiie invention 
should be detennined with reference to the claims. 
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APPENDIX I -HDL DESCRIPTION 

This file has been redacted 

— Confidential Information 

Limited Distribution to Authorized Persons Only 
Created 1996 and Protected as an Unpublished Work 
Under the U.S. Copyright Act of 1976. 
. Copyright © 1996 - 2001 ARC CORES LTD. 
All Rights Reserved. 
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— Inputs and Outputs : 

— L indicates a latched signal, U indicates an signal produced by logic. 
Stage 1 - Opcode fetch — 



— in pliw[31:0] 
controller. 
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this 



not 

depending 
at 

to 

— in ivic 
Cache) . 
from a 

to three 
generates 
when a 



— out pcen 
the pc 

the 



U The instruction word supplied by the memory 

It is considered to be valid when the ivalid signal 

true . 

U Qualifying signal for pliw[31:0]. When it is low, 

indicates that the m/c has not been able to fetch the 
requested opcode, and that the program counter should 

be incremented. The pipeline might be stalled, 

upon whether the instruction in stage 2 needs to look 

the instruction in stage 1 . 

When it is true, the instruction is clocked into 
pipeline stage 2 provided that the pipeline is able 

move on . 

U Indicates that all values in the cache are to be 
invalidated, (it stands for Invalidate Instruction 

It is anticipated that this signal will be generated 

decode of an SR instruction. 

Note that due to the pipelined nature of the ARC, up 

instructions could be issued following the SR which 

the ivic signal. Cache invalidates must be suppressed 

line is being loaded from memory. This is done at the 
auxiliary register which generates ivic- 

U Program counter enable. When this signal is true, 

will change at the end of the cycle, indicating that 
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memory controller needs to do a fetch on the next 

using the address which will appear on currentpc[], 

is supplied from aux_regs . vhd. 

This signal is affected by interrupt logic and all 

other pipeline stage enables . 

U This signal, similar to pcen, indicates to the 

controller that a- new instruction is required, and 

be fetched from memory from the address which will be 
clocked into currentpc [25 : 2] at the end of the cycle. 

is also true for one cycle when the processor has 

started following a reset, in order to get the ball 
rolling. 

An instruction fetch will also be issued if the host 
changes the program counter when the ARC is halted, 
provided it is not directly after a reset. 

The ifetch signal will never be set true whilst the 
memory controller is in the process of doing an 
instruction fetch, so it may be used by the memory 
controller as an acknowledgement of instruction 

U This signal is true when an instruction fetch has 

issued, and it has not yet completed. It is not true 
directly after a reset before the ARC has started, as 

instruction fetch will have been issued. It is used 

hold off host writes to the program counter when the 

is halted, as these accesses will trigger an 

fetch. 

U indicates that an interrupt has been detected, and 

interrupt-op will be inserted into stage 2 on the 

cycle, (subject to pipeline enables) setting p2int 

This signal will have the effect of canceling the 
instruction currently being fetched by stage 1 by 

p2iv to be set false at the end of the cycle when 
is true. 

U Stage 2 pipeline latch control. True when an 

is being latched into pipeline stage 2. Will be true 
at different times to pcen, as it allows junk 

to be latched into the pipeline. 
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A feature of this signal is that it will allow an 
instruction be clocked into stage 2 even when stage 3 
is halted, provided that stage 2 contains a killed 
instruction (i.e. p2iv = '0')- This is called a 
' catch- up' . *** 



Stage 2 - Operand fetch 
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out en2 

the 
end 
2 



— out p2i[4:0] 
qualified 

out p2iv 

the 
The 
has 
allow 



.35 — out fsla[5:0] 
the 



hostif ) 



— out s2a[5:0] 
the 



— out dest[5:0] 
from 



— out si en 
from 

out s2en 

from 



U Pipeline stage 2 enable. When this signal is true, 
instruction in stage 2 can pass into stage 3 at the 
of the cycle. When it is false, it will hold up stage 
and stage 1 (pcen) . 

L Opcode word. This bus contains the instruction word 
which is being executed by stage 2. It must be 

by p2iv. 

L Opcode valid. This signal is used to indicate that 

opcode in pipeline stage 2 is a valid instruction. 

instruction may not be valid if a junk instruction 

been allowed to come into the pipeline in order to 

the pipeline to continue running when an instruction 
cannot be fetched by the memory controller. 

L Source 1 register address. This is the B field from 

instruction word, sent to the core registers (via 

and the LSU. It is qualified for LSU use by slen. 

L Source 2 register address. This is the C field from 

instruction word, sent to the core registers and the 
LSU. It is qualified for LSU use by s2en. 

L Destination register address. This is the A field 

the instruction word, send to the LSU for register 
scoreboarding of loads. It is qualified by the desten 
signal. 

U This signal is used to indicate to the LSU that the 
instruction in pipeline stage 2 will use the data 

the register specified by fsla[5:0]. If the signal is 
not true, the LSU will ignore fsla[5:0]. This signal 
includes p2iv as part of its decode. 

U This signal is used to indicate to the LSU that the 
instruction in pipeline stage 2 will use the data 
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the register specified by s2a[5:0]. If the signal is 
not true, the LSU will ignore s2a[5:0]. This signal 
includes p2iv as part of its decode. 

U From extensions. This signal is used to hold up 

stages 1 and 2 (pcen, enl and en2) when extension 

requires that stage 2 be held up. For example, a core 
register is being used as a window into SRAM, and the 

is not available on this cycle, as a write is taking 

from stage 4, the writeback stage. Hence stage 2 must 

to allow the write to complete before the load can 

Stages 3 and 4 will continue running. 

U This signal is used to indicate to the LSU that the 
instruction in pipeline stage 2 will use the data 

the register specified by de5t[5:0]. If the signal is 
not true, the LSU will ignore dest[5:0]. This signal 
includes p2iv as part of its decode. 

~ out p2offset [19:0] L This bus carries the region of the instruction 
which 

contains the branch offset. It is used by the program 
counter generation logic when the instruction in 

stage 2 

is a Bcc/BLcc or LPcc. 

— out p2condtrue U This signal is produced from the result of the 

internal , ^ ^ . 

stage 2 condition code unit or from an extension cc 

unit 
selects 



— in xholdupl2 
pipeline 

logic 



SRAM 
place 
be held 
happen . 

— out desten 
from 



jump 
simpler 

immediate 



50 must 



(if implemented) . A bit (bit 5) in the instruction 

between the internal and extension cc unit results. 
As stage 2 conditionals are only used by branch and 

instructions, the logic to produce this signal is 

than that required from p3condtrue, which takes into 
account the complications presented by short 

data registers, amongst other things. When using 
p2condtrue, a decode for a branch/ jump instruction 

always be included along with a check for p2iv = '1' 



~ out p2setflags L This is bit 8 from the instruction word at stage 2, 

i.e. the .F or set flags bit used in the jump 

55 instruction. ^ 

It is used in flags.vhd to determine whether the 

flags . ^ -» 

should be loaded by a jump instruction. The stage 3 
signal p3setflags is much more complicated, having to 
60 ~ take into account the complications presented by 

short 
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— out p2jblcc 



which 



out p2st 

from 



out p21do 
which 



30 which 

imm 
35 used. 



— out p21r 
which 

register 

instruction - 

passed 
register) 

read 



— out mload2 
valid 

decode 

signal. 



immediate data, amongst other things. 

L This signal indicates that an interrupt jximp 

(fantasy instruction) is currently in stage 2. This 

has a number of consequences throughout the system, 
causing the interrupt vector (int_vec [25 : 2] ) to be 

into the PC, and causing the old PC to be placed into 
the pipeline in order to be stored into the 

interrupt link register. 

Note that p2int and p2iv are mutually exclusive. 

U True when a JLcc or BLcc instruction is in stage 2. 
Does not include p2iv. 
Used in conjunction with the branch delay slot mode 

is re-created from the short immediate field. 

U This signal is used by coreregs . vhd. It is produced 

a decode of p2i[4:0], p2iw(25) (check for SR) and 
does not include p2iv. 

U True when p2i[4:0] = oldo, and p2iw(13) = '0', 

indicates that the instruction is an LDO, not an LR 

is an encoding of the LDO instruction. 

This signal is used by coreregs. vhd to' switch short 

data onto a source bus when an LDO instruction is 

Does not include p2iv. 

U True when p2i[4:0] = oldo, and p2iw(13) =. '1', 

indicates that the instruction is the auxiliary 

load instruction LR, not a memory load LDO 

This signal is used by coreregs. vhd to switch the 
currentpc bus onto the source2 bus (which is then 

through the same logic as the interrupt link 

in order to get the correct value of pc when it is 

by an LR instruction. 
Does not include p2iv. 

U This signal indicates to the LSD that there is a 
load instruction in stage 2. It is produced from a 
of p2i[4:0], p2iw(13) (to exclude LR) and the p2iv 
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— in holdupl2 
pipeline 

unit 

stage 



0 This signal indicates to the actionpoint mechanism 

selected that there is a valid store instruction in 

2. It is produced from a decode of p2i[4:0], p2iw(13) 
(to exclude SR) and the p2iv signal. 

U From Isu.vhd. This signal is used to hold up 

stages 1 and 2 (pcen and en2) when the load store 

finds a register being used by the instruction at 

2 which is the destination of a delayed load. It will 
also be set when the scoreboard unit is full and the 
" ARC attempts to do another load. Stages 3 and 4 will 

will continue running. 

" in aluf lags [3:0] L ALO flags, direct from the latches in flags.vhd 

0 From extensions. Indicates that the register 

by fsla[5:01 is not available for shortcutting. This 

should only be set true when the register in question 

an extension core register. This signal is ignored 

constant xt_corereg is set true. 

O From extensions. Indicates that the register 

by s2a[5:0] is not available for shortcutting. This 

should only be set true when the register in question 

an extension core register. This signal is ignored 

constant xt_corereg is set true. 

0 True when a relative branch {not j\amp) is going to 
Relates to the instruction in p2. Includes p2iv. 



— in x_p2noscl 
referenced 

signal 

is 

unless 

— in x_p2nosc2 
referenced 

signal 

is 

unless 

— out dorel 
happen . 



45 — out dojcc 



U True when a jump is going to happen. 

Relates to the instruction in p2. Includes p2iv, 



50 



55 



-- out p2killnext U True when the instruction in stage 2 is a 

branch/jump type ^^^^^^^^^ ^^^^^ ^.^i ^ill the following delay slot 
instruction. following operation will be marked invalid when 

1^ passed from stage 1 into stage 2. 



Stage 3 - ALU 



60 — out en3 
the 



a Pipeline stage 3 enable. When this signal is true, 
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field- 
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This 
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out p3iv 

the 
The 
has 
allow 

an 



in p3int 
instruction 



signal 



40 register. 



— in p3ilevl 
which 

which of 



which is 



out p3condtrue 
internal 

unit 

selects 

In 



instruction in stage 3 can pass into stage 4 at the 

of the cycle. When it is false, it will probably hold 

stages one (pcen) , two (en2), and three. 

L Opcode word. This bus contains the instruction word 
which is being executed by stage 3. It must be 
qualified by p3iv. 

Jj Instruction A field. This bus carries the region of 
the instruction which contains the operand dest 

L Instruction C field. This bus carries the region of 
the instruction which contains the operand C field. 

is used to encode extra single-operand functions onto 
the FLAG instruction opcode . 

L Opcode valid. This signal is used to indicate that 

opcode in pipeline stage 3 is a valid instruction. 

instruction may not be valid if a junk instruction 

been allowed to come into the pipeline in order to 

the pipeline to continue running when an instruction 
cannot be fetched by the memory controller, or when 

instruction has been killed. 

U This signal indicates that an interrupt jump 

(fantasy instruction) is currently in stage 3. This 

causes (in conjunction with p3ilevl) the appropriate 
interrupt mask bits to be cleared in the status 

Note that p3int and p3iv are mutually exclusive. 

U This is used in conjunction with p3int to indicate 

level of interrupt is being processed, and hence 

the interrupt mask bits should be cleared. 

It comes from bit 7 of the jiamp instruction word, 

set when a 'levell (lowest level) interrupt is being 
processed. 

U This signal is produced from the result of the 
stage 3 condition code unit or from an extension cc 
(if implemented) . A bit (bit 5) in the instruction 
between the internal and extension cc unit results. 
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10 — out p3setflags 
instructions and 

flags 

15 the 

the 

20 implied 
will 
flags . 



45 



50 



55 



60 



— out p3cc[3:0] 
which 



30 sent 
test 

35 be 

extensions 
40 is 



in xpSccmatch 

code 

stage 3) 

word 
internal 
extension 
instruction 
different 



addition, this signal is set true If the instruction 

using short immediate data. As it is only used by 
flags.vhd in conjunction with the p3i=oflag, and 
with p3setflags, it does not include a decode for 
instructions which do not have a condition code field 
(i.e. all load and store operations). 
Does not include p3iv. 

U This signal is used by regular alu-type 

■the jump instruction to control whether the supplied 

get stored. It is produced from the set-flags bit in 

instruction word, but if that field is not present in 

instruction (e.g. short immediate data is being used) 
then it will either come from the set- flag modes 

by which short immediate data register is used, or it 

be set false if the instruction does not affect the 

Does not include p3iv. 

L This bus contains the region of the instruction 

contains the four-bit condition code field. It is 

with the alu flags to the extension condition code 

logic which provides in return a signal (xp3ccmatch) 
which indicates whether it considers the condition to 

true. The ARC decides whether to use the internal 
condition-true signal or the signal provided by 

depending on the fifth bit of the instruction. This 

handled within rctl.vhd. 

U This signal is provided by an extension condition- 
unit which takes the condition code field from the 
instruction (at stage 3), and the alu flags (from 

performs some operation on them and produces this 
condition true signal. Another bit in the instruction 

indicates to the ARC whether it should use the 

condition-true signal or the one provided by the 

logic. This technique will allow extra ALU 

conditions to be added which may be specific to 

implementations of the ARC. 
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out sc_reg2 
unit rctl, 

going to 

2 of 

shortcut . 
stage 3 

banned 
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out sc_^load2 
load is 
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U This signal is produced by the pipeline control 

and is set true when an instruction in stage 3 is 

generate a write to the register being read by source 

the instruction in stage 2. This is a source 1 

It is used by the core register module to switch the 

result bus onto the stage 2 source 1 result. 
Extension core registers can have shortcutting 

if x_p2noscl is set true at the appropriate time. 
Includes both p2iv and p3iv. 

The. lastsl signal is sc_regl and sc_loadl ORed 

U This signal is set true when data from a returning 

required to be shortcut onto the stage 2 source 1 

bus. This will only be the case if fast-load-returns 

enabled, or if a four-port register file is used. If 

register file is implemented, the data used for the 

comes direct from the, memory system, this requiring 

additional input into the shortcut muxer. 
Extension core registers can have shortcutting 

if x_p2noscl is set true at the appropriate time. 
Includes both p2iv and p3iv. 

The lastsl signal is sc_regl and sc_loadl ORed 

U This signal is produced by the pipeline control 

and is set true when an instruction in stage 3 is 

generate a write to the register being read by source 

the instruction in stage 2. This is a source 1 

It is used by the core register module to switch the 

result bus onto the stage 2 source 2 result. 
Extension core registers can have shortcutting 

if x_p2nosc2 is set true at the appropriate time. 
Includes both p2iv and p3iv. 

The lasts2 signal is sc_reg2 and sc__load2 ORed 

U This signal is /set true when data from a returning 
required to be shortcut onto the stage 2 source 2 



10 



wo 01/69378 
are 

the 4p 

shortcut 

an 

banned 



15 together. 

— out pSdolink 
which is 

20 taken. 
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separate 



bus. This will only be the case if fast-load-returns 

enabled, or if a four-port register file is used. If 

register file is implemented, the data used for the 

comes direct from the memory system, this requiring 

additional input into the shortcut muxer . 
Extension core registers can have shortcutting 

if x_p2nosc2 is set true at the appropriate time. 
Includes both p2iv and p3iv. 

The lasts2 signal is sc_reg2 and sc_load2 ORed 

L This signal is latched (with en2) from p2dolink 

true when a JLcc or branch-and-link instruction was 

indicating that the link register needs to be stored. 

is used by alu.vhd to switch the program counter 

which has been passed down the pipeline onto the 

bus. If this signal is to be used to give a fully 

indication that a J/BLcc is in stage 3, it must be 

with p3iv to take account of pipeline tearing between 
stages 2 and 3 which could cause the instruction in 

three to be repeated. 

L This signal is latched (with en2) from p2dolink 

true when a JLcc or branch-and-link instruction was 

indicating that the link register needs to be stored. 

is used by alu.vhd to switch the program counter 

which has been passed down the pipeline onto the 

bus. If this signal is to be used to give a fully 

indication that a J/BLcc is in stage 3, it must be 

with p3iv to take account of pipeline tearing between 
stages 2 and 3 which could cause the instruction in 

three to be repeated. 

0 This signal is used by hostif .vhd. It is produced 

a decode of p3i[4:0], p3iw(13) (check for LR) and 
includes p3iv. Also used in extension logic for 

decoding of auxiliary accesses from host and ARC. 
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— out mstore 
valid 

decode 
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— out sex 
during 

different 

— out nocache 
cache . 

is 

instructions . 



U This signal is used by hostif .vhd. It is produced 

a decode of p3i[4:0], p3iw(25) (check for SR) and 
includes pSiv. Also used in extension logic for 

decoding of auxiliary accesses from host and ARC. 

U This signal indicates to the LSU that there is a 

load instruction in stage 3. It is produced from a 

of p3i[4:0], p3iw(13) (to exclude LR) and the p3iv 

U This signal indicates to the LSU that there is a 
store instruction in stage 3. It is produced from a 
of p3i[4:0]/ p3iw(25) (to exclude SR) and the p3iv 

L This pair of signals are used to indicate to the 

the size of the memory transaction which is being 
requested by a LD or ST instruction. It is produced 
during stage 2 and latched as the size information 

are encoded in different places on the LD and. ST 
instructions. It must be qualified by the 

signals as it does not include an opcode decode, 

L This signal is used to indicate to the LSU whether 
a sign-extended load is required. It is produced 

stage 2 and latched as the sign-extend bit in the two 
versions of the LD instruction (LDO/LDR) are in 

places in the instruction word, 

L This signal is used to indicate to the LSU whether 
the load/store operation is required to bypass the 

It comes from bit 5 of the Id/st control group which 

found in different places in the Ido/ldr/st 
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— out ldvalid_wb U This signal is used to control the switching of 
returning 

load data onto the writeback path for the register 

file. 



be 



in Idvalid 



It is set true whenever returning load data must pass 
through the regular load writeback path - this will 

loads to r32-r60 for a 4p regfile system, or loads to 
rO-r60 for a 3p regfile system, 

U From LSU. This signal is set true by the LSU to 
indicate that a delayed load writeback WILL occur on 
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25 Idvalid 
latched 
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in mwait 



35 meiuory 



in xshimm 
instruction 

that 



values for 



50 ~ in xlioldupl23 



but the 



in X idecode3 



60 extensions 



the next cycle. If the instruction in stage 3 wishes 

perform a writeback, then pipeline stage 1, 2 and 3 

be held. If the instruction is stage 3 is invalid, or 
does not want to write a value into the core register 
set for some reason, and fast-load-returns are 

then the instructions in stages 1 and 2 will move 

2 and 3 respectively, and the instruction that was in 
stage 3 will be replaced in stage 4 by the delayed 

writeback. 

Note that delayed load writebacks WILL complete, 

if the processor is halted (en=0) - In this instance, 

host may be held off for a cycle (hold^host) if it is 
attempting to access the core registers. ** 

U From LSU. This bus carries the address of the 

into which the delayed load will writeback when 

is true, rctl.vhd will ensure that this value is 

onto wba[5:0] at the end of a cycle when Idvalid is 

even cycles when the processor is halted (en = 0). 

U From MC. This signal is set true by the MC in order 
to hold up stages 1, 2, and 3. It is used when the 
memory controller cannot service a request for a 

access which is being made by the LSU. It will be 
produced from mload, mstore and logic internal to the 
memory controller. 

U From extensions. Indicates that an extension 

in stage 3 is using short-immediate data other than 

implied by the use of one of the short -immediate data 
registers. It is used by rctl to ensure correct 

p3condtrue and p3setflags are generated. Qualified by 
x_idecode3/ xt_aluop and p3iv (eventually) . 

U From extensions. This is used by extension ALU 
instructions to hold up the pipeline if the function 
requested cannot be completed on the current cycle. 
Pipeline stages 1/ 2 and 3 will typically be held, 

writeback (stage 4) will continue. 

L From extensions. This signal will be true when the 
extension logic detects an extension instruction in 
stage 3. It is latched from x_idecode2 by the 

when en2 is true at the end of a cycle. 
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It is used to correctly generate p3condtrue, 
and to detect (along with xnwb) when a register 
will take place. 

U From extensions. Extension instructions utilise the 
normal writeback-control logic (ins.cc's, dest=ixnm, 
short imm data etc) , but in addition have extra 

When the extension logic has 'claimed' an instruction 
in stage 3 by setting ' x_idecode3, it can also disable 
writeback for that instruction by setting xnwb. When 
x_idecode3 is low, or if the instruction is 'claimed' 

the ARC, xnwb has no effect. 

U From extensions- This signal is provided to 

extension instructions to utilise basecase ALU 

for their own purposes. This is intended to be used 

load up fifo command buffers for pixel engines etc. 

The ALU only decodes the bottom four bits of the 
instruction opcode directly, and has extra logic to 

account of the x_ialusel signal. 
If extensions want to shadow an ALU operation, the 

bit is set (ie an extension instruction), whilst the 
rest of the instruction is set up as per the basecase 

instruction. The ALU result mux will select an . 

ALU result if 14 = 0 (basecase instruction) , or if 
i4==l (extension), x_ialusel=l (use int. result), 
x_idecode3=l (valid extension instruction) . The 

logic should also set xnwb to prevent writeback to 

core register set. Flag setting will work normally 

the xsetflags signal is set, in which case the flags 
will be loaded from the xf lags [3:0] bus. 
xp2idest should be set when the instruction is in 

to prevent the scoreboard unit from checking the dest 
register field. 
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in actionhalt This signal is set true when the 

actionpoint (if selected) has been triggered by a 

valid 

condition. The ARC pipeline is halted and flushed 

when 

this signal is '1'. 
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Note: The pipeline is flushed of instructions when 

breakpoint instruction is detected, and it is 

to disable each stage explicitly. A normal 

stage one will mean that instructions in stage two, 

and four will be allowed to complete. However, for an 
instruction in stage one which is in the delay slot 

branch, loop or jump instruction means that stage two 
has to be stalled as well. Therefore, only stages 

and four will be allowed to complete - 

U To flags.vhd. This signals to the ARC that a 

instruction has been detected in stage one of the 
pipeline. Hence, the halt bit in the flag register 

be updated in addition to the BH bit in the debug 
register. The pipeline is stalled when this signal is 

to '1'. 

Note: The pipeline is flushed of instructions when 

breakpoint instruction is detected, and it is 

to disable each stage explicitly. A normal 

stage one will mean that instructions in stage two, 

and four will be allowed to complete. However, for an 
instruction in stage one which is in the delay slot 

branch, loop or jump instruction means that stage two 
has to be stalled as well. Therefore, only stages 

and four will be allowed to complete. 

r This signals to the ARC that the 
'pipeline has been flushed due to a breakpoint or 

instruction. If it was due to a breakpoint 

the ARC is halted via the 'en' bit, and the AH bit is 
set to '1' in the debug register. 

This is used by the actionpoint 
debugging system when selected to qualify the value 

the PC at stage one of the pipeline. The limm data is 
considered to be at the same the value address as the 
instruction it is associated with regards to the 
debugger . 
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Sleep Mode signals 



— in sleeping This is the sleep mode flag ZZ in the debug register 
5 — (bit 23) . When it is true the ARC is stalled. This 

flag 

— is set when the p2sieep_inst is true and 

cleared on restart or interrupt. 

10 — out p2sleep_inst This signal is set when a sleep instruction has been 

decoded in pipeline stage 2. It is used to set the 

sleep 

mode flag ZZ (bit 23) in the debug register. 



15 



Instruction Step signals 



20 



— in do_inst_step This signal is set when the single step flag (SS) and 
the 

instruction step flag (IS) in the debug register has 



been 
It 

25 perfoarmed. 
goes 

30 — out stop step 



written to simultaneously through the host interface. 

indicates that an instruction step is being 

When the instruction step has finished this signal 

low. 

This signal is set when the instruction step has 
finished. 
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40 



ENTITY rctl IS 
PORT( 



signal ck 
signal clr 
signal en 



in std_ulogic; 
in std_ulogic; 
in std_ulogic; 



-** Stage 1 



— system clock 

— system reset 

— system go 
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signal 


piiw 


: in 


std ulogic_vector (31 downto 0) ; 


signal 


ivalid 


: in 


std_ulogic; 


signal 


ivic 


: in 


std_ulogic; 


signal 


pcen 


: out 


std_ulogic; 


signal 


ifetch 


: out 


std_ulogic; 


signal 


ipending 


: out 


std_ulogic; 


signal 


enl 


: out 


std__ulogic; 


signal 


plint 


: in 


std__ulogic; 



signal en2 

signal p2i 

signal p2iv 

signal fsla 

signal s2a 



•** Stage 2 



out std_ulogic; 

out std_ulogic__vector (4 downto 0) ; 
out std_ulogic/ 

out std_ulogic_vector (5 downto 0) ; 
out std_ulogic_vector (5 downto 0) ; 
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signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 

signal 



dest 
slen 
s2en 

xholduplZ 

desten 

xp2idest 

x_i decode 2 

p2shiiran 

p2of fset 

p2cc 

xp2ccmatch 

p2condtrue 

p2setf lags 

p2int 

p2jblcc 

p2st 

p21do 

p21r 

mload2 

mstore2 

holdupl2 

aluf lags 

x_p2noscl 

x_p2nosc2 

dorel 

doj cc 

p2killnext 



signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 



en3 
P3i 
p3a 
p3c 
pSiv 
p3int 
p3ilevl 
p3condtrue 
p3setflags 
p3cc 

xp3ccinatch 
lastsl 



signal lasts2 
signal sc_regl 



signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 



sc_reg2 

sc_loadl 

sc_load2 

p3dolink 

p31r 

p3sr 

mload 

mstore 

size 

sex 

nocache 

idvalid 

regadr 

mwait 

xshimm 

xholdupl23 



48 

out 
out 
out 
in 
out 
in 
: in 
; out 
: out 
; out 
: in 
: out 
: out 
: in 
: out 
: out 
: out 
: out 
: out 
: out 
: in 
: in 
: in 
: in 
: out 
: out 
: out 



std_ 
std_ 
std^ 
std^ 
std" 
std 
std" 
std" 
std] 
std" 
std' 
std] 
std] 
std] 
std" 
std' 
std] 
std" 
std 
std' 
std 
std 
std 
std 
std 
std 
std 



IT/3LJS01/07360 



ulogic_vector (5 downto 0); 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 

ulogic_vector ( 8 downto 0 ) ; 

ulogic_vector (19 downto 0) ; 
ulogic_vector (3 downto 0); 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
_ulogic; 
_ulogic; 
_ulogic; 
ulogic; 
_ulogic; 
^ulogic; 
_ulogic; 

_ulogic_vector (3 downto 0); 
ulogic; 
_ulogic; 
_ulogic; 
[_ulogic; 
[_ulogic; 



-** Stage 3 



out std_ulogic;. 

out std_ulogic_vector (4 downto 0) 
out std__ulogic_vector (5 downto 0) 
out std__ulogic_vector (5 downto 0) 
out std_ulogic; 
in std_ulogic; 
in std_ulogic; 
out std_ulogic; 
out std_ulogic; 

out std_ulogic_vector (3 downto 0) ; 

in std__ulogic; 

out std_ulogic; 

out std_ulogic; 

out std_ulogic; 

out std_ulogic; 

out std_ulogic; 

out std_ulogic; 

out std_ulogic; 

out std_ulogic; 

out std_ulogic; 

out std_ulogic; 

out std_ulogic; 

out std__ulogic_vector (1 downto 0) ; 
out std__ulogic; 
out std_ulogic; 
in std_ulogic; 

in std_ulogic_vector (5 downto 0) ; 
in std_ulogic; 
in std_ulogic; 
in std_ulogic; 
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signal 


X idecode3 


: in 


std_ulogic; 


s iana.1 


xnwb 


: in 


std_ulogic; 


signal 


p3wb_en 


: Qut 


std_ulogic; 


signal 


p3wb_nxt 


: out 


std_ulogic; 


signal 


pBwba 


: out 


std ulogic_vector (5 downto 0) ; 


signal 


p3_ni__wbrq 


: out 


i 

std_ulogic; 


signal 


ldvalid_wb 


: out 


std_ulogic; 



-** Debug interface 



15 



20 



25 



signal actionhalt : in std_ulogic; 

signal hw_brk_only : in std_ulogic; 

signal sleeping : in std_ulogic;. 

signal do_inst_step : in std_ulogic; 

signal stop_step : out std_ulogic; 

signal p2sleep_inst : out std__ulogic; 

signal brk_inst : out std_ulogic; 

signal p21iinin : out std_ulogic; 

signal AP_p3disable_r : out std_ulogic; 

signal p21iram_data_r : out std__ulogi caveator (31 downto 0) ; 

signal fetch_rol linger : in std_ulogic; 

signal p2merge_valid_r : out std_ulogic; 



30 END rctl; 



35 



ARCHITECTURE synthesis OF rctl IS 



40 



45 



50 



55 



60 



— internal signals: 

SIGNAIj i_i fetch 
SIGNAL ipcen 
SIGNAL ienl 
SIGNAL ienl lowpower 



std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 



— for debugging and halting the pipeline stages 



SIGNAL i_brk_decode 
SIGNAL i_brk_inst 
SIGNAL i_brk__pass 
SIGNAL i_kill_AP 
SIGNAL i_break_stagel 
SIGNAL i_break__stage2 
SIGNAL i_AP_p2disable_r 
SIGNAL i_AP_p3disable_r 
SIGNAL i_n_AP_p2disable 
SIGNAL i_n_AP_p3disable 
SIGNAL ip2sleep_inst 
signal istop_step 
signal 'inst_stepping 
signal plp2step 



std 
std" 
std' 
std^ 
std' 
std] 
std] 
std' 
std" 
std] 
std] 
std] 
std] 
std' 



ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
_ulogic; 
ulogic; 
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signal p2step 
signal p3step 
signal pcen_step 



SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 



ien2 
±p2iw 
ip2i 
ip2a 
ip2b 
ip2c 
ip2q 
ip2dd 
ip21d 
ip2_fbit 
ip2iv 
ip2ccniatch 
ip2condtrue 
lshi_bf 
ishi_bn 
ishi_cf 
ishi_cn 
lp2shiinm 
ip2shiiiimf 
islen 
is2en 
idesten 
ip2jblcc 
ip2mop_e 
ip2si2e 
ip2sex 
ip2awb 
ip2nocache 
ip21do 
ip2st 
ip21imm 
ip2bch 
ip2killnext 
ip2pldep 
ip2r jmp 
ip2 jumping 
ip2nojuinp 
ip21pcc 
ip2dolink 
ilastsl 
ilasts2 



SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 



ien3 

ip3i 

ipBa 

ip3b 

ip3c 

ip3q 

ip3_fbit 

ip3shiinni 

ip3shiinmf 

ip3iv 

ip3ccniatch 

ip3condtrue 

ip3setflags 

ip3size 

ip3sex 



^^^TAJSO 1/07360 
50 ^"'^ 

std__ulogic; 
std_ulogic; 
std_ulogic; 

std__ulogic; 

std_ulogic_vector (31 downto 0) ; 
std_ulogic_vector (4 downto 0) ; 
std_ulogic_vector (5 downto 0); 
std_ulogic_vector (5 downto 0) ; 
std_ulogic_vector (5 downto 0); 
std_ulogic_vector (4 downto 0) ; 
std_ulogic_vector (1 downto 0) ; 
std_ulogic; 
std_ulogic; 
std__ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_u logic; 
std_ulogic; 

s t d_ul ogi c_ve c t or ( memop_e s z downt o 0 ) ; 
std_ulogic_vector (1 downto 0); 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
s t dialogic; 
std_ulogic/ 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 
st dialogic; 
std_ulogic; 
std_ulogic; 
std__ulogic; 
std^ulogic; 



0) 
0) 
0) 
0) 
0) 



std_ulogic; 

std ulogic_vector ( 4 downto 
std_ulogic_vector (5 downto 
std_ulogic_vect or ( 5 downto 
std_ulogic_vector {5 downto 
std_ulogic_vector ( 4 downto 
std_ulogic; 
std_ulogic; 
std_ulogic; 
std^ulogic; 
std_ulogic; 
std_ulogic; 
std_ulogic; 

std_ulogic__vector (1 downto 0) 
std_ulogic; 
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35 



SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 



ip3awb 

ip3nocache 

ip3wb_en 

ipSdolink 

ipBdiiniti 

iitiloadB 

imstoreB 

ip3m_awb 

ip3ccwb_op 

ip3xwb_op 

ip3_wb_req 

ip3_wb_rsv 

ip31r 

ipSsr 



SIGNAL ip3wba 
SIGNAL ip3_sc_wba ; 
SIGN7UJ iwben 

SIGNAL new_p2iw 
SIGNAL new_p3i 
SIGNAL new_p3a 
SIGNAL new_p3b 
SIGNAL new_p3c 
SIGNAL new_p3q 
SIGNAL new_p3_fbit 
SIGNAL new_p3shiinin 
SIGNAL new_p3shiinmf 
SIGNAL new_p3size 
SIGNAL new_p3sex 
SIGNAL new_p3awb 
SIGNAL new_p3nocache 
SIGNAL new_p3dolink 
SIGNAL new_wba 
SIGNAL iwba 



std 
std' 
std" 
std' 
std' 
std" 
std" 
std" 
std 
std 
std 
std 
std 
std 



ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
ulogic; 
._ulogic; 



std__ulogic_vector (5 downto 0); . 
std_urogic_vector (5 downto 0); 
std_ulogic; 

std__ulogic_vector (31 downto 0); 
std_ulogic_vector (opcodsz downto 0); 
std_ulogic_vector (oprandsz downto 0) ; 
std_ulogic_vector (oprandsz downto 0); 
std_ulogic_vector (oprandsz downto 0) ; 
std_ulogic_vector (qqsiz downto 0); 
std_ulogic; 
std_ulogic; 
std__ulogic; 

std_ulogic_vector ( 1 downto 0) ; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic; 

std_ulogic_vector ( oprandsz downto 0) 
std_ulogic__vector (oprandsz downto 0) 



SIGNAL 
SIGNAL 

40 SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 
SIGNAL 

45 SIGNAL 

SIGNAL 
SIGNAL 

50 SIGNAL 
SIGNAL 

SIGNAL 
SIGNAL 
55 SIGNAL 
SIGNAL 
SIGNAL 

SIGNAL 
60 SIGNAL 
SIGNAL 



n__p2iv 

n_p3iv 

i_awake 

l_go 

n_go 

ni_go 

i_hostload 
ip3wb_nxt 

ip21imml 
ip21iiam2 

ien3_non_iv 
ien2_non_iv 

ip3_load_stall 

isc_regl 

isc_reg2 

isc_loadl 

isc_load2 

ihp2_ld_nscl 
ihp2_ld_nsc2 
ihp2__ld_nsc 



: std_ulogic; 

: std_ulogic; 

: std__ulogic; 

: std_ulogic; 

: std__ulogic; 

: std_ulogic; 

: std_ulogic; 

: std_ulogic; 

: std_ulogic; 

: std_ulogic; 

: std_ulogic; 

: std_ulogic; 

: std_ulogic; 

: std_ulogic; 

: std_ulogic; 

: std_ulogic; 

: st dialogic; 

: std_ulogic; 

: std_ulogic; 

: std__ulogic; 
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SIGNAL ibch_holdp2 
SIGNAL ibch_p3flagset 



std^ulogic; 
std_ulogic; 



5 SIGNAL ildvalid_wb : std__ulogic; 

signal ip2ivalid_r : std__ulogic; 

signal ip21iram_data_r : std_ulogic_vector (31 downto 0) ; 

signal i_p2merge_valid_r : std_ulogic; 

signal i_f st_if etch_r : std_ulogic; 
10 signal i_p2_f st_if etch_r : std_ulogic; 

signal i_fetchen : std_ulogic; 

signal i_pending_kill_r : std_ulogic; 

signal i_cancel_kill_r : std_ulogic; 

signal i_if etcher : std_ulogic; 
15 signal i__ipending : std_ulogic; 

signal i_pl_used_r : std_ulogic; 



BEGIN 



20 



25 



— New Outputs 

p21iinm_data_r <= ip21iinm_data_r ; 
p2merge_valid_r <= i_p2merge_valid_r; 
ipending <- i_ipending; 

** Stage 1 **- 



30 merge_jprocess : process (ck, clr) 
begin 

if clr = '1' then 

ip2ivalid_r <= '0*; 
35 i_p2merge_valid_r <= '0'; — PS 

ip21iinm_data_r <= (others => '0'); 

i_fst_if etcher <= '1'; 

i_p2_fst_if etch_r <= '1'; 

i_pending_kill_r <== ' 0 ' ; 
40 i_cancel_kill_r <= ' 0 ' ; 

i_pl__used_r <= ' 0 • ; 



45 



elsif (ck' EVENT and ck = 'l') then 

— Latch ivalid for use in stage 2 
ip2ivalid_r <= ivalid; 

50 — Latch in long immediates when an instruction in stage 2 

— references a long immediate and its available in stage 1 

— Record that the long immediate is available and has been 

— merged with the opcode. 

— Indicate that the dataword in stage 1 has been used. 

55 

if ivalid = • 1 ' and ip21imm - ' 1 ' then 
ip21iinm_data_r <= pliw; 
i_pl_used_r <= ' 1 ' ; 
i_p2merge_valid_r <= * 1 ' ; 
60 end if; 

— Indicate that the dataword in stage 1 has been used. 
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if ienl_lowpower = '1' then 

i_pl_used__r <= ' 1 ' ; 
end if; 

5 

— When a new instruction dataword is requested clear the pl_ 

— used flag 

if i_ifetch = '1' then 
10 i_pl_used_r <= '0'; 

end if; 

— When the instruction in stage 2 moves and it references a 

Ion 

15 — immediate clear the p2merge valid flag. 

if ien2 = '1' and ip21imm = '1' then 
i__p2merge__valid_r <= ' 0 ' ; 
end if; 

— Start up the pipeline so stage 1 can advance stage 0 when 

— stage 0 is stalled or an ivic is requested. 



20 



if ivic = '1' or i_ifetch = '0' then 
25 i__fst_if etcher <= '1'; 

i_p2_fst_ifetch_r <= '1'; 

end if; 

30 — Clear the ifetch advancement flag 

if i_ifetch = '1' then 

i_fst_if etcher <= ' 0 ' ; 
end if; 

35 



40 



— Clear the ifetch advancement flag 
if i_fst_if etcher = '0' and ivic = '0' then 

i_p2_fst_if etcher <= • 0 • ; 
end if; 

— Re-intialize to ifetch advancement flags 

— since if etching has been stalled 

if ivic = '1' or i_ifetch = '0' then 
45 i_f st_if etch_r <='!'; 

i_p2_fst__ifetch_r <= '1'; 
end if; 

50 if ivalid = '0* and ip2killnext = '1' and ien2 = '0' then 

i_cancel_kill_r <= ' 1 ' ; 
end if; 

if ien2 = '1' then 
55 i_cancel_kill_r <= ' 0 ' ; 

end if; 

if ivalid = '0' and ip2killnext = '1' and ien2 = '1' 
60 and i__cancel_kill__r = '0' then 

i_pending_kill_r <= ' 1 ' / 
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end if; 

— Clear the pending instruction kill flag when the 

instruction 
5 — is killed 

if i_pending_kill_r = '1* and ivalid = '1' then 

i_pending_kill_r <= ' 0 ' ; 
end if; 

10 — Not used . • . 

i_ifetch_r <= i_i fetch; 

end if; 
15 end process; 



20 



40 



60 



— ** stage 1 logic ** — 

— The breakpoint instruction is determined at stage 1 from: 



[1] Decode of pliw, 

[2] Instruction at stage 1 is valid, 

25 — [3] The instruction is not killed, 

[4] The instruction is not long immediate data, 

[5] There is no sleep instruction in stage 2. 

i_brk_decode <= '1' WHEN (pliw (instrubnd downto instrlbnd) = of lag) 
30 AND {pliw(copubnd downto coplbnd) = so_brk) 

AND (pliw(shimmlbnd) « '0') 
AND (ip2ivalid_r « '1') else 

•0'; 

35 i_brk__pass <= NOT {ip2killnext ) AND 

NOT(ip21imm) AND 
N0T{ip2sleep_inst) ; 



i_brk_inst <= i_brk_decode AND i_brk_pass; 

br k_inst <= ' 0 ' ; — i_br k^inst ; 
** stage 2 ** 



45 — 

— The sleep instruction is determined at stage 2 from: 

50 — [1] Decode of p2iw, 

[2] Instruction at stage 2 is valid. 

ip2sleep_inst <= '1* WHEN (ip2iw (instrubnd downto instrlbnd) = 
55 oflag) 

AND (ip2iw (copubnd downto coplbnd) = so^sleep) 

AND (ip2iw(shimmlbnd) = '1') 

AND (ip2iv = '1') ELSE 

•0'; 



p2sleep__inst <= ip2sleep_inst ; 
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— The data to be used for input to stage 2 is latched here. 

— Clock the instruction presented by the memory controller when ienl is 

— true. - Also clock p2iv, which is ivalid clocked when ienl is true. 
This signal is used to indicate which instructions in the pipeline 

are 

— real and which are junk which is being allowed to flow through to 
keep 

— things running. 



p2ins: 

new_p2iw 



pipe32 PORT MAP (ck, ienl_lowpower, clr, pliw, ip2iw) ; 
<= pliw WHEN ienl_lowpower = '1' ELSE ip2iw; 



p2ins : PROCESS (ck, clr) 
BEGIN 

IF clr = THEN 

ip2iw <= (others => • 0 * ) ; 

ELS IF (ck' EVENT AND ck = '1') THEN 

if ienl__lowpower = *1' and ip21iinm - *0' then 

end if; 
ip2iw <= new_p2iw; 
END IF; 
END PROCESS; 



— The various component parts of the instruction are extracted -here to 

— internal signals. 



mode 



ip2i <= ip2iw (instrubnd downto instrlbnd) ; 

ip2a <= ip2iw (aopubnd downto aoplbnd) ; 

ip2b <= ip2iw (bopubnd downto boplbnd) ; 

ip2c <= ip2iw (copubnd downto coplbnd) ; 

ip2_fbit <= ip2iw(setflgpos) ; 

ip2q <= ip2iw(qqubnd downto qqlbnd) ; 

ip2dd <= ip2iw (ddxibnd downto ddlbnd) ; 



- — opcode 

— a field 

— b field 

— c field 

— flag bit 

— q field 

— delay slot 



— Output drives of signals direct from the stage. 2 input latch . 

— (some more extraction takes place also) 



p2i 
dest 
fsla 
s2a 

p2shimm 
iinrriediate 
p2cc 

bit) 

p2of f set 



<= ip2i; 
<= ip2a; 
<= ip2b; 
<= ip2c; 

<« ip2iw (shimmubnd downto shimmlbnd) ; 

<= ip2iw(ccubnd downto cclbnd) ; 

<= ip2iw(targubnd downto targlbnd) ; 



— opcode 

— destination 

— source 1 

— source 2 

— short 

— CC field (no x 

— branch offset 
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— Now some simple decodes from the opcode field are performed. 

— These are for files which do their own decode of the p2i[] field. 

ip2jblcc <= '1' WHEN (ip2i = oblcc) — branch 

5 and link 

OR ( (ip2i = ojcc) AND (ip2c(0) = '1')) — jump and 

link. 

ELSE '0'; 

10 — output drives — 

p2jblcc <= ip2jblcc; 

ip2st <= '1' WHEN (ip2i = ost) AND ip2iw(25) = '0' ELSE, '0'; — ST 
15 instruction. 

p2st ip2st/ 

mstore2 <= ip2st AND ip2iv; 

20 

— The load instruction has two opcodes Idr (00) and Ido (01) . The aux LR 

— instruction is encoded on the Ido instruction, so must be excluded 
when 

— producing a signal which indicates that a load instruction is in 
25 stage 2. 





ip21d 


<= 


•1' WHEN (ip2i = 
OR {ip2i = 


: oldr) 
= oldo 


AND 


ip2iw(13) = 


'0' ) 


ELSE 


30 


mload2 


<= 


*0' ; 

ip21d AND ip2iv; 














p21r 


<= 


'1' WHEN (ip2i = 


oldo) 


and 


ip2iw(13) = 


'1' ELSE 


'0' ; 


35 


ip21do 
p21do 


<= 
<« 


'1' WHEN (ip2i = 
ip21do; 


oldo) 


and 


ip2iw(13) « 


'0' ELSE 


•0' ; 



40 — output drives — 



slen <= islen; 
s2en <= is2en; 
desten <= idesten; 



— Output drive — 

50 p2condtrue <= ip2condtrue; 

— Stage 2 flag setting calculation — 

55 — p2set flags just comes from bit 8 of the instruction word. It is only 

— used in flags.vhd and is qualified there with a decode of p2i'=ojcc, 

— and a check for p2iv and p2condtrue. 



60 



p2set flags <= ip2_fbit; 
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— Produce signals to pass down the pipeline which indicate to stage 3 

— which register fields include iiranediate data registers, qualified with 

— the slen/s2en/desten signals. 

— Here four signals are produced, one for each of the combinations of 

— the two source fields and the two short immediate data registers 

— (i.e. set flags/don't set 'flags) 



40 



ishi_bf <= '1' WHEN ( ip2b = rfshimm ) and islen = '1' ELSE '0* 

10 ishi_bn <« '1' WHEN ( ip2b = rnshiiran ) and islen = '1' ELSE 

ishi_cf <= '1' WHEN ( ip2c = rfshimm ) and is2en = '1' ELSE '0' 

ishi_cn '1' WHEN ( ip2c = rnshimm ) and is2en = '1* ELSE '0' 

— Now produce signals which indicate whether a short-imm field is 
15 present 

— at the bottom of the instruction, due to a register {ip2shimm) , 

— and indicate whether the flags should be set or not (ip2shimmf ) . 

20 ip2shimm <= ishi__bf OR ishi_bn OR ishi_cf OR ishi_cn/ 

ip2shimmf <= ishi_bf OR ishi_cf; 

— Now extract the extra encoding information used for loads and stores. 
25 — The signals are extracted and latched at the end of stage 2. 



ip2mop_e <= ip2iw (ldo_eubnd downto ldo_elbnd) WHEN ip21do = '1* 

30 ELSE 

ip2iw(st_eubnd downto st_elbnd) WHEN ip2st = '1' 

ELSE 

ip2iw(ldr_eubnd downto ldr_elbnd) ; 

35 ip2nocache <= ip2mop__e ( ls_nc) ; 

■ ip2size <= ip2inop_e (is_subnd downto ls_slbnd) ; 

ip2sex <- ip2mop_e (ls_ext ) ; 

ip2awb <= ip2mop_e (ls_awbck) ; 



— Generate signals for pipeline control and interrupt control units — 



— p21iiran - this will be true when a valid instruction which uses long 
imm 

45 — data is in stage 2. Note that this signal will include p2iv as it 
includes 

— slen/s2en. 

— p2bch - this will be true when a jump instruction bcc/blcc/lpcc/ jcc is 
50 — in stage 2. This also includes p2iv, but explicitly this time. 

— p2pldep - this signal is used to indicate that the instruction at 
stage 2 

— requires that the next instruction be in stage 1 before it can move 
55 off. 

This may be either to ensure correct delay slot operation for a 
branch 

— or to make sure that long immediate data is fetched, and then killed 

— before it can be processed as an instruction. 
60 ~ 
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ip21iiimil <= '1* WHEN ip2b = rliimn ELSE 
ip21iinin2 <= '1' WHEN ip2c = rliinm ELSE 



# 
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5 ip21iinm 

ip2bch 
oblcc 

10 ojcc ) ELSE 



15 



ip2pldep 

p21iinia 

p2pldep 



<= (ip21iiraal AND islen) OR (ip21iinm2 AND is2en) ; 
<= '1' WHEN ip2iv = '1' AND (ip2i = obcc OR ip2i = 

OR ip2i = olpcc OR ip2i 

•0' ; 

<== ip21imm OR ip2bch; 
<= ip21iinm; 
<= ip2pldep; 



20 



25 



30 



35 



40 



45 



50 



55 



60 



-** Pipeline control unit **- 



— ivalid U From memory controller. Indicates that the 

instruction/ data 

word presented to the ARC on pliw[31:0] is valid. 



— plint 



— p2int 



— p2bch 
is 



slot 

— p21iinin 
uses 

means 

data 

does 

invalid 

has 

o verwr i 1 1 en 



U Indicates that an interrupt has been detected, and an 
interrupt-op will be inserted into stage 2 on the next 
cycle, setting p2int true. This signal will have the 
effect of canceling the instruction currently being 
fetched by stage 1 by causing p2iv to be set false at the 
end of the cycle when plint is true. 

L Indicates that an interrupt-op instruction is in 
stage 2. This signal is used in coreregs.vhd to control 
the placing of the pc onto a source bus for writing back 
to the interrupt link registers, and by aux_regs to 
insert the interrupt vector int_vec[] into the program 
counter, thus requiring this file to set pcen true. 

U This signal indicates that the instruction in stage 2 

a branch or jxamp instruction, and therefore requires that 
the instruction following must be present in the delay 

before it can move on. 

(Simple decode of p2i[4:0], and does include p2iv) 
U This signal indicates that the instruction in stage 2 
long immediate data for one of the source operands. This 
that the instruction cannot complete until the correct 
word has been fetched into stage 1 . When the instruction 
move out of stage 2, the data word is marked as an 
instruction before it gets into stage 2. The data word 
served its purpose by this point, so it can be 
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by another instruction if stage 3 is stalled, and stage 1 
allowed to move on into stage 2 over the top of the data 
This signal includes slen/s2en and p2iv. 



— holdupl2 U From Isu-vhd. This signal is used to hold up pipeline 

stages 1 and 2 (pcen, enl and en2) when the load store 



10 



15 



unit 



— xholdupl2 
pipeline 



20 register 



25 



30 



35 



40 



45 



50 



55 



60 



to 



the 
the 



finds a register being used by the instruction at stage 
2 which is the destination of a delayed load. It will 
also be set when the scoreboard unit is full and the . 
ARC attempts to do another load. Stages 3 and 4 will, 
will continue running. 

U From extensions. This signal is used to hold up 

stages 1 and 2 (pcen, enl and en2) when extension logic 
requires that stage 2 be held up. For example, a core 

is being used as a window into SRAM, and the SRAM is not 
available on this cycle, as a write is taking place from 
stage 4, the writeback stage. Hence stage 2 must be held 

allow the write to complete before the load can happen. 
Stages 3 and 4 will continue running. 

p2killnext U This signal indicates that the delay slot mechanism of 

jump instruction currently in stage 2 is requesting that 



instruction 
fieid 

move 

— Idvalid 



— mwait 



next instruction be killed before it gets into stage 2. 
This signal is produced from a decode for a jump 

code, the condition-true signal, p2iv and the delay-slot 

in the instruction. This signal relies on the delay slot 
instruction being present in stage 1 before stage 2 can 

on. This is handled elsewhere by this file. 

U From LSU. This signal is set true by the LSU to 
indicate that a delayed load writeback WILL occur on 
the next cycle. If the instruction in stage 3 wishes to 
perform a writeback, then pipeline stage 1, 2 and 3 will 
be held. If the instruction is stage 3 is invalid, or 
does not want to write a value into the core register 
set for some reason, then the instructions in stages 1 
and 2 will move into 2 and 3 respectively, and the 
instruction that was in stage 3 will be replaced in 
stage 4 by the delayed load writeback. 

Note that delayed load writebacks WILL complete, even 
if the processor is halted (en=0) . In this instance, the 
host may be held off for a cycle (hold_host) if it is 
attempting to access the core registers- 

U From MC. This signal is set true by the MC in order 
to hold up stages 1, 2, and 3. It is used when the 
memory controller cannot service a request for a memory 
access which is being made by the LSU. It will be 
produced from mload3, mstore3 and logic internal to the 
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memory controller. 



— mload3 



signal . 



— xholciupl23 



p3_wb_req 



BLcc) 



p3 wb rsv 



that 



— cr_hostw 

latches 
host- 
delayed 
in 



h_pcwr 

host 



fetch 



U This signal indicates to the LSU that there is a valid 
load instruction in stage 3. It is produced from a decode 
of p3i[4:0], p3iw(13) (to exclude LR) and the p3iv 

It is used here to ensure that a lockup situation cannot 
occur when a branch is holding up stages 1, 2 and 3. 

U From extensions . This is used by extension ALU 
instructions to hold up the pipeline if the function 
requested cannot be completed on the current cycle. 
Pipeline stages 1, 2 and 3 will be held, but the 
writeback (stage 4) will continue. 

U This signal (produced by rctl.vhd) is set true when the 
instruction in stage 3 wants to writeback to the register 
file, i.e. - 

a. A destination register is given (r0-r60) 

b. A link register to be be written (interrupt, 

c. LD/ST with .A specified - to do address writeback 

It will be false when no destination is requried, i.e. 

a. jumps /branches (not BLcc) 

b. instructions with dest — immediate 

c. instructions for which the condition is false 

d. LD/ST without .A specified - no address writeback 

e. cancelled instructions (p3iv = '0') 

f. extension instruction, xnwb = '1' 

U This signal is set true when the instruction at stage 3 
wants to reserve the writeback stage for itself. This is 
required when a FIFO-type instruction wants to suppress 
writeback to the register file, but needs the data and 
register address to be present in the writeback stage so 

it can be picked off and sent into the FIFO buffer. 

Is it generated by rctl.vhd and will be true when an 
extension instruction at stage 3 is suppressing writeback 
with the xnwb signal. 

U This signal is set true to indicate that a host write 
to the core registers will take place on the next cycle, 
and that the end-of-stage 3 data and register address 

should clock in the address and data provided by the 

Note that host writes are overridden by returning 

loads. This signal hold_host will be asserted (produced 

rctl.vhd) to tell the host to wait for a cycle. *** 

U From pcounter. vhd- This signal is set true when the 

is attempting to write to the pc/status register, and the 
ARC is stopped. It is used to trigger an instruction 

when the PC is written when the ARC is stopped. This is 
necessary to ensure the correct instruction is executed 
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when the ARC is restarted. 



— Outputs : 

5 enl 

instruction 



10 



15 



50 



instructions 



instruction 



ifetch 



U Stage 2 pipeline latch control. True when an 

is being latched into pipeline stage 2. Will be true 
at different times to pcen, as it allows junk 

to be latched into the pipeline. 

*** A feature of this signal is that it will allow an 
instruco be clocked into stage 2 even when stage 3 
is halted, provided that stage 2 contains a killed 



{i.e. p2iv 



'0'). This is called a 'catch-up' 



20 clocked 

25 changes 
is 

30 memory 
fetch, 

35. ~ 

— . ipending 

directly 

40 instruction 

45 — pcen U 



instruction 



55 — en2 
the 



60 



U This signal, similar to pcen, indicates to the memory 
controller that a new instruction is required, and should 
be fetched from memory from the address which will be 

into currentpc[25:2] at the end of the cycle. It is also 
true for one cycle when the processor has been started 
following a reset, in order to get the ball rolling. 
An instruction fetch will also be issued if the host 

the program counter when the ARC is halted, provided it 

not directly after a reset. 
The ifetch signal will never be set true whilst the 

controller is in the process of doing an instruction 

so it may be used by the memory controller as an 
acknowledgement of instruction receipt. 

U This signal is true when an instruction fetch has been 
issued, and it has not yet completed. It is not true 

after a reset before the ARC has started, as no 

fetch will have been issued. It is used to hold off host 
writes to the program counter when the ARC is halted, as 
these accesses will trigger an instruction fetch. 

This signal is true when the pc is allowed to change state. 
It takes account of ivalid (stage 1 has fetched a valid 
instruction) and the interrupts which need to be able to 
prevent the pc from updating. 

*** A feature of this signal is that it will allow an 
instruction to be clocked into stage 2 even when stage 3 
is halted, provided that stage 2 contains a killed 

(i.e. p2iv « '0'). This is called a 'catch-up'. *** 

U Stage 3 pipeline latch control- Controls transition of 
instruction in stage 2 to stage 3. Will be set false if 

op in stage 2 requires data from stage 1 which is not 
forthcoming because the instruction cannot be fetched to 
stage 1 during this cycle (i.e. ivalid = '0'). This 



condition 
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5 — en3 
set 

complete 
10 It 



15 



20 



30 



35 



40 



45 



50 



55 



60 



Taken 



p3wb_en 

the 

register 



data 

writebacks 
25 the 

instruction in 
register. 



whilst 
however 



— wben 
p2iv 

it 



— p3iv 
it 



will occur for instructions which use long immediate data 
or for jxjinp/branch instructions which require the correct 
instruction to be in the delay slot. 

U Stage 3 instruction completion control. This signal is 

true to indicate that the instruction in stage 3 can 

at the end of the cycle and pass out of pipeline stage 3. 

may or may not pass into stage 4 (the writeback stage) , 
depending on whether a writeback is required or not . 

on its own, this signal controls writeback to the flags. 

U Stage 4 pipeline latch control. Controls transition of 

data on the p3result [31 : 0] bus, and the corresponding 

address from stage 3 to stage 4 . As these buses carry 

not only from instructions but from delayed load 

and host writes, they must be controlled separately from 

instruction in stage 3. This is because if the 

stage 3 does not need to write a value back into a 

and a delayed load writeback is about to happen, the 
instruction is allowed to complete (i.e. set flags) 

the data from the load is clocked into stage 4 . If 

the instruction in stage 3 DOES need to writeback to the 
register file when a delayed load writeback is about to 
happen, then the instruction in stage 3 must be held up 
and not allowed to change the processor state, whilst the 
data from the delayed load is clocked into stage 4 from 
stage 3. 

*** l^ote that p3wb_en can be true even when the processor 
is halted, as delayed load writebacks and host writes use 
this signal in order to access the core registers. *** 

L This signal is the stage 4 write enable signal. It is 
latched from p3wb_en. Stage 4 is never held up. 

L Pipeline stage 2 instruction valid. This latched signal 
indicates that the instruction in stage 2 is valid. When 

is set false, the instruction is stage 2 is either a junk 
value clocked in to keep the pipeline running, or an 
instruction which was killed by the interrupt system. 

L Pipeline stage 3 instruction valid. This latched signal 
indicates that the instruction in stage 3 is valid. When 

is set false, the instruction is stage 3 is either a junk 
value clocked in to keep the pipeline running, or an 
instruction which was killed by the interrupt system, or 
a blank slot inserted when the instruction in stage 2 was 
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10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



not allowed to complete on the previous cycle. This blank 
slot must be inserted otherwise the instruction which was 
executed by stage 3 during the previous cycle will be 

again during the current cycle. 
pcen : Program counter update enable 



the 

contains 
and 



This signal indicates to the program counter that a new value can be 
loaded- This will be the case when: 

a. A valid instruction has been fetched and can be passed on to 
stage 2, allowing the memory controller to start looking for 

next instruction to be executed. 

*** Note that this logic handles the case when stage 2 
an invalid instruction which is held due to stall in stage 3, 
we allow the instruction in stage 1 to move into stage 2. *** 



b. 

new 
has 

program 
interrupt 



An interrupt is in stage 2, and the interrupt vector is to be 

clocked into the program counter. The instruction now being 

fetched into stage 1 will be killed anyway, but we must wait 

until it has been fetched to be sure that we do not issue a 

fetch request to the memory controller before the last one 

completed - 

The interrupt vector should only be clocked into the 

counter when the interrupt can move out of stage 2. This will 
ensure that the correct pc value will be placed in the 

link register. 



— We will also want to forcibly prevent the program counter from being 

— updated in some cases; 

a. An interrupt has been recognized, and we want to kill the 
instruction currently in stage 1, and not increment the 



program 
into 

b. 

and 

c. 

preventing 
one 
PC is 
ifetch) 



counter in order to ensure the correct PC value is stored 
the appropriate interrupt link register. 

The breakpoint instruction (or valid actionpoint) is detected 

the pipeline is to be flushed, and then halted. 

A single instruction step is being executed, whilst 

another ifetch from being generated in order to only execute 

instruction at a time. During a single instruction step the 

only allowed to be updated and (thereby generating a new 
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when : 

1. a valid instruction in stage 1 is allowed to pass into 
stage 2. 

2. a branch or jump instruction is in stage 2 has a killed 
delay slot. 

3. an instruction is in stage 2 that uses a long immediate. 

4. an interrupt has been detected and is now in stage 2. 



*** 

15 be 

be 



Note that if an invalid instruction in stage 2 is held (this will 
due to a stall at stage 3) then the instruction in stage 1 will 
allowed to move into stage 2- *** 



20 — Added to allow pc updates to advance i fetching 

— ip2ivalid_r prevents the core advancing more than 1 cycle 

— i_p2_fst__if etcher « '1' and i_f st_if etch_r = '0' allow the core to 

— initially advance if etch to get thing 'rolling' 

25 

ipcen <= '0' WHEN en = '0' 

OR (ip2ivalid_r = '0' and not {i_p2_f st_if etch_r = '1' 
and i_fst_if etcher = '0')) 

— or (ip21imin = '1' and i p2merge valid r = 

30 •!') _ - _ 

OR (p2int = ' 1 ' AND ien2_non_iv = ' 0 ' ) 
OR (ip2iv = '1' AND ien2_non_iv = '0') 
OR (i_break_stagel = '1') 

or (ip21imin = '1' and ip2killnext = '1' and 

35 i_p2merge_valid_r = ' 0 ' ) 

OR inst_stepping = ' 1 ' 

OR plint = '1' ELSE 
40 inst_stepping : PC Disable for single instruction step 



— The signal inst_stepping prevents the PC from being updated, by 
disabling 

45 — the PC enable signal (pcen) . The signal is set when a single 
instruction " 

— step is being performed and the PC does not need to be updated 

— (pcen_step = ' 0 " ) • 



inst_stepping <= ' 1 ' WHEN do_inst_step - ' 1 ' 

AND pcen_step = '0' ELSE 

'0'; 



55 — The signal pcen^step is set when a single instruction step is being 

— executed if the PC needs to be updated. This happens in the following 

— cases : 

a. A valid instruction in stage one is allowed to pass into 
60 — stage 2. 
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b. A branch or jump in stage 2 has a killed delay slot. 

c. An instruction using long inmiediate is in stage 2. 

5 — d. An interrupt has been detected and is now in stage 2. 

pcen_step <= '1' WHEN {do_inst_step = '1' 

AND p2step = '0' ) 
10 OR (p2step = '1' 

AND {ip2killnext ='1' 
OR ip21iinm «'!•)) 
OR p2int = '1' 

ELSE 

15 '0'; 



30 



stop_step : stop single instruction step when finished 



20 — The stop_step signal is related to single instruction step. When the 
single instruction has been completed the stop_step signal goes 

high. 

Depending on the type of instruction the stop is made in different 
places in the pipeline: 

25 ~ 

a. Branches and jumps with delay slots that are not killed stop in 
stage 2, because the instruction in the delay slot count as a new 
instruction. Next instruction step will execute the branch 
and the delay slot. 

b. All other instructions complete in stage 3 (if writeback is not 
perfo3rmed) or stage 4 (if writeback is performed) . 

When the stop_step signal goes high the ARC is halted, 
35 — the step tracker signals (below) are reset and a new instruction 
fetch 

is generated. 



40 istop_step <= '1' WHEN (ip2bch = '1' 

AND ip21iiran = '0' 

AND ip2killnext « '0' 

AND p2step = '1' ) 
OR (p3step - '1* 
45 AND ip3wb_en = ' 0 ' ) 

ELSE 

'0' ; 

stop_step <= istop_step; 

50 

step tracker : keeps track on the single step instruction — 



The step_tracker process keeps track on where in the pipeline the 
55 — instruction is during single instruction step. It generates three 
tracking signals: plp2step, p2step and pSstep . The signal p2step 
is high when the instruction is in pipestage 2 and pBstep is high 
when the instruction is in pipestageS. As you see in the timing 
diagram below p2step and p3step stays high after being set until 
60 — the cycle after the stop signal stop_step is issued, which means 
— that the instruction has completed. 



! 
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Here is an example how the step tracker process works for an 
instruction with writeback and no long immediate. The pipeline 
is clean before the step starts . 



ck / \ / \ / \ / \ / \. 



— do inst step / 

10 — ~ ^ 

ienl / \ 



ivalid / 



15 — plp2step 
— p2step 



20 



60 



p3step 



ip3wb_en / \. 



stop_step ^1 \_ 



25 — The signal plp2step is set when a valid instruction has moved from 
stage 1 to stage2. This signal sets p2step. But p2step is not only 
set by plp2step but also if there is already an instruction in 
stage 2 that uses long immediate or has a killed delay slot or if 
an interrupt is in stage 2 (p2int is set) . This can happen if the 

30 — ARC was just halted after running in free-running. The pipeline 

can then be filled with anything in this situation. This can only 
happen on the first instruction step after free-running mode. On 
— the second consecutive instruction step the pipeline will be clean. 

35 p2step <= plp2step OR 

(do_inst_step AND (ip21iinm OR ip2killnext OR p2int)); 

step^tracker: PROCESS (ck, clr) 

40 BEGIN 

IF clr '1' THEN 

plp2step 'C ; 
p3step <= '0' ; 
45 ELSIF (ck' EVENT AND ck = '1') THEN 

IF istop^step = '1' THEN 

plp2step <= '0'; 
ELSIF (ienl = '1' AND (ivalid = '1' OR plint = '!')) THEN 
50 ■ plp2step <= do_inst_step; 

END IF; 

IF istop_step = '1' THEN 

p3step <= '0' ; 
55 ELSIF ien2 = '1' THEN 

p3step <= p2step; 
END IF; 



END IF; 
END PROCESS; 
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— A load of signals inserted to reduce the complexity of the logic 
5 — minimization task for the ivalid signal 

ien2_non_iv <= '0' WHEN en = ' 0 ' 

OR ien3__non_iv = ' 0 ' 
OR (holdupl2 OR ihp2_ld_nsc) = '1' 
10 OR xholdupl2 = '1' 

OR ibch_holdp2 = '1' ELSE 

• 1 • ; 

ien3 non_iv <= ' 0 ' WHEN en = ' 0 ' 
15 OR (xholdupl23 AND xt_aluop) = '1' 

OR mwait » i « - 
OR ip3_load_stall = '1' 

ELSE 



ifetch : Tell M/C to do a fetch 



— This signal is used to tell the memory controller to do another 

25 — instruction fetch with the program counter value which will appear at 
the end of the cycle. It is normally the same as pcen except for when 

— the processor is restarted after a reset, when an initial instruction 

— fetch request must be issued to start the ball rolling. 

In addition, ifetch will be set true when the host is allowed to 
30 change 

~ the program counter when the ARC is halted. This will means that the 
new 

program counter value will be passed out to the memory controller 

— correctly. The ifetch signal is not set true when there is an 
35 instruction 

— fetch still pending. 

~ Signal i_awake will be true for one cycle after the processor is 
started 
40 — after a reset. 

i_awake <= en T^D NOT l_go; 

— Signal i_hostload will be true when an new instruction fetch needs to. 
45 be 

— issued due to the host changing the program counter - 



50 i hostload <= 'l' WHEN h_pcwr = 

AND ip2ivalid_r '1' AND n_go = '1' else 



55 



60 
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OR (i_break_stagel = 'I') 
OR inst_stepping = '1' 

OR plint = '1' ELSE 
• 1 ' ; 

5 

The if etch signal comes from either pcen, kick-start after reset, or 

— when a fetch is required as the host has changed the program counter - 

10 i_ifetch <= i_fetchen OR i_awake OR i_hostload; 

— ARC3 i_ifetch <= pcen OR i_awake OR i_hostload; 

~ The latch is set true after the processor is started after a reset, 
and 

15 — will stay true until the next reset. 

— l_go is taken low when the instruction cache is invalidated 

— This is in order to prevent a lockup situation 

20 n_go <= en OR l_go; 

ni_go <— n_go AND not ivic; 

lego: PROCESS (ck, clr) 

25 BEGIN 

IF clr = '1' THEN 

l_go <= • 0 • ; 
ELSIF (ck' EVENT AND ck = '1') THEN 
30 l_go <- ni_go; 

END IF; 

END PROCESS; 

35 ipending : An instruction is being fetched 

— This signal is set true when an instruction fetched has been issued, 

— (i.e. not directly after reset) and the fetch has not yet completed, 
40 — signaled by ivalid «= ' 0 ' . 

It is used to prevent writes to the pc from the host from generating 

— an if etch request when there is already an instruction fetch pending. 

— Host accesses are rejected with hold_host, generated in hostif .vhd 

45 

ipend : process (ck, clr) 
begin 

IF clr = '1' THEN i_ipending <= '0'; 
ELSIF ck = '1' AND ck' event THEN 

— entry state : when ARC is started onwards 

IF i_ifetch = '1' THEN i_ipending '1'; 
END IF; 

55 — entry state : when ARC is started onwards 

IF i_ifetch '1' THEN i_ipending <= '1'; 
END IF; 

60 — exit state : i.e. when no more fetches are required 



50 
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— Or: An instruction cache invalidate puts us back into 

— the immediately post-reset condition. 

IF (i_ifetch = '0' AND ivalid 

OR (ivic = 'IM THEN ipending <= '0'; 

END IF; 



10 ivalid 



20 



— Pending ifetchs in the core are cancelled when an 
— returns from the cache or an invalidate is requested 



IF ( — i^ifetch = '0' AND 
ivalid = '1' ) 

15 OR (ivic = THEN i_ipending <= '0'; 

END IF; 



END IF; 
END PROCESS; 



enl : Pipeline 1 -> 2 transition enable 



— This signal is true at all times when the processor is running except 
25 — when: 

a. A valid instruction in stage 2 cannot complete for some reason, or 

if an interrupt in stage 2 is waiting for a pending instruction 

fetch 

30 — to complete . 

— b. A breakpoint instruction (or valid actionpoint) is detected and 

stage 2 has to be halted, while the remaining stages are flushed, 

and 

35 — . then halted. 

— c. The single instruction has already moved on to stage 2 and this 

instruction does not depend on the following instruction. 
This is a special case that only happens during single instruction 
40 — step. Because single instruction step finishes the instruction 

was in pipeline stage 1, this is actually the starting mechanism 



that 
of the 



single instruction stepping. The next instruction is not allowed 

45 to pass on until the instruction in further down the pipe has 

completed and not until a new single instruction step command 
— has been generated. 



50 be 
be 



*** Note that if an invalid instruction in stage 2 is held (this will 
due to a stall at stage 3) then the instruction in stage 1 will 
allowed to move into stage 2. *** 



55 --An additional disable flag is added to cope with ifetch stalling and 
—the cache keeping ivalid high even though the instruction in stage 1 
— has moved to stage 2 or further, 

ienl '0' WHEN en = ' 0 ' 

60 OR (p2step = '1' AND ip2pldep '0' AND p2int = '0') 

OR (i_break_stagel = '1') 
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OR (p2int = '1' AND ien2 = 'OM 
OR (ip2iv = AND ien2 = '0') 

or i_pl_used_r = '1' 

— or (i_ifetch_r = '0* and i_ipending = *0') 
5 — or (i_ipending ^ 'O'and i__awake = '0') 

ELSE 

•1'; 

— The signal ienl^lowpower (below) is almost always equal to ienl 

10 (above) , except 

— when the opcode is not valid. This prevents invalid opcodes to 

propagate to 

— pipeline stage 2 and thereby power is saved. 

15 This is ONLY used to enable the p2iw latch. The global ENl stays as 

normal . ^ >, ^ 
The ivalid signal is also used in sync_regs to switch off RAM reads 

when the 

— new instruction is not valid. 



20 



55 



60 



ienl_lowpower <= '0' WHEN (ivalid = '0') ELSE 

ienl; 



25 en2 : Pipeline 2 -> 3 transition enable 



This signal is true when the processor is running, and the 

instruction 

30 in stage 2 can be allowed to move on into stage 3. It may be held up 

for 

— a number of reasons: 

a. A register referenced by the instruction is currently the 

35 subject 

of a pending delayed load. (holdupl2 from the scoreboard 

unit) . 

b. Stage 2 contains an instruction which requires a long 
40 immediate 

data value from stage 1 which cannot be fetched on this 



cycle . 



(ip21imm = »1') 



45 c. Stage 2 contains a jump/branch instruction, which require 

that the 

correct instruction be present in the delay slot following 



the 

jiamp/branch instruction. 
50 — ( ip2bch = '1', ivalid - '0') 

An interrupt in stage 2 is waiting for a pending instruction 
fetch to complete before issuing the fetch from the interrupt 
vector. 

e. A valid instruction in stage 3 is held up for some reason . 
- Note that stage 3 will never be held up if it does not 

contain 

— a valid instruction. 

f . Extensions require that stage 2 be held up, probably due to a 
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10 



15 



20 



25 



30 



35 



40 



45 



50 



setting 
Stage 2 

h. 

an 

consumption 
— i . 

delay slot 
j • 

stage 2 



register not being available for a read on this cycle. 
The branch protection system detects that an instruction 
flags is in stage 3, and a dependent branch is in stage 2. 
is held until the instruction in stage 3 has completed. 
The opcode is not valid (ip2iv = *0*) and this is hot due to 
interrupt (i.e p2int - '0')- This is done to reduce power 

The actionpoint debug mechanism or the breakpoint instruction 

is triggered and thus disables the instructions from 
going into stage 3 when the instruction in stage 1 is the 

of a branch/ j imp instruction. 

A branch/ jump with a delay slot that is not killed is in 
during single instruction step. 



— All ivalid have been changed to p2ivalici__r so the processor doesn't 

— get more than one cycle ahead 

— Additionally instructions in stage 2 referencing a liitim can only move 

— after the limm is merged. 

— Also when stage 2 is stalled when pi has be used . • . 



ien2 



and 



'0' WHEN en = '0' 

OR ien3 = '0' 

OR (holdupl2 OR ihp2_ld_nsc) = '1' 
OR xholdupl2 = ' 1 ' 

or (i_p2merge_valid_r = '0' 

p21imm = ' 1 ' ) 

OR (p2int = '1' AND ip2ivalid_r = '0') 

OR (ip2bch = '1' AND ip2ivalid_r = 'O') 

OR (ip21imm = »1* AND ip2ivalid_r = '0') 

OR ibch_holdp2 = '1' 

OR {ip2iv = '0' AND p2int = '0') 

OR (i_break_stage2 ='!')■ 

or i_pl__used_r = '1' 

OR (plp2step = '1' AND ip2bch = '1' 

AND ip21imm = '0' AND ip2killnext = '0') 

ELSE 



'1'; 



ibch_holdp2 : Branch protection system 



— In order to reduce code size, we want to remove the need to have a NOP 
55 — between setting the flags and taking the associated branch. 



60 



— e.g. 



sub. f 
nop 

bz 



0,rO,23 
rO is 23 



; is r0=23? 

; padding instruction. «- 



10 
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In order that the compiler does not have to generate these 

instructions, , . 

— we can generate a stage 2 stall if an instruction m stage 3 is 

attempting 

to set the flags. Once this instruction has completed, and has passea 

out 

— of stage 3, then stage 2 will continue. 

~ We need to detect the following types of valid instruction at stage 3: 

i. Any ALU instruction which sets the flags (p3set flags) 
ii, Jcc.F or JLcc.F 

— iii. A FLAG instruction. 

ibch __p3flagset <- ip3iv WHEN (ip3setflags - '1') 

OR ((ip3i = ojcc) AND (ip3_fbit = 'IM) 

— Jcc/JLcc . ^ ^-1 X V 
20 OR ((ip3i = oflag) AND (ip3c = so_flag) ) 

ELSE — FLAG 

•0'; 

~ In order to generate the stall, we also need to detect a valid branch 
25 instruction. 

— present in stage 2 (ip2bch) . 

— We generate a stall when the two conditions are present together: 

30 — a. An instruction in stage 3 is attempting to set the flags 

— b. A branch instruction at stage 2 needs to use these new flags 

~- Note that it would be possible to detect the following conditions to 

35 --^theoretical improvements in performance. These are very marginal, and 

--^been left out here for the sake of simplicity, and the fact it would 
be 

~ difficult for the compiler to take advantage of these optimizations. 
40 ~ Both cases remove the link between setting the flags and the 

following , w u 

~ branch, either because the flags don't get set, or because the branch 



45 



50 



doesn' t 

— check the flags. 

— i. Conditional flag set instruction at stage 3 does not set flags 

e.g. add-cc.f rO,rO,rO, resulting in C=l 

~ ii. Branch at stage 2 uses the AL (always) condition code. 



ibch_holdp2 <= '1' WHEN (ibch_jp3f lagset = '1') p3 

setting flags w v. 

AND (ip2bch = •!') ELSE ~ branch 

55 in p2 



'0 
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en3 : Stage 3 instruction completion control 



10 



15 



20 



— This signal is true when the processor is running, and the 
instruction 

in stage 3 can be allowed complete and set the flags if appropriate. 

— Stage 3 may be prevented from completing for a number of reasons : 



time 
only 
the 

or 



a. An extension multi-cycle ALU operation has requested extra 
to complete the operation (xholdopl23) . Note that this can 
be the case when extension alu operations are enabled with 
xt_aluop constant in extutil.vhd. 

b. The memory controller is busy and cannot accept any more load 
store operations, (mwait) 

c. Deleted in v6. 



25 



30 



35 



40 



45 



50 
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p2iv : Stage 2 instruction valid 



— This signal indicates that stage 2 contains a valid instruction. The 
instruction in stage 2 may not be valid for a number of reasons: 



was in 
2, 

interrupt 



stage 2 



order 



A breakpoint /act ionpoint has been detected, and instructions 

stage two are to be invalidated for when the ARC is to be 
restarted. 

The correct instruction word could not be fetched in time, so 
a junk instruction is inserted into the pipeline to keep it 
flowing . 

An interrupt was recognized, causing the instruction which 
stage 1 (valid or not) to be killed. 

The interrupt which was recognized, and which is now in stage 
requires a blank delay slot to perform the jump to the 
vector- The instruction in stage 1 is therefore killed. 

e. A long immediate data value was required by the previous 
instruction, and is killed to prevent it being executed as a 
real instruction. 

f . The delay slot mechanism of a jump/branch instruction in 

has decided that the following instruction should be killed. 
Note that this instruction must be present in stage 1 in 

to be killed, before the pipeline can be moved on. This is 
handled by the en2 signal - 
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The single instruction in stage 2 will move on to stage 3 the 
next cycle.. This is a special case which only occurs during 
a single instruction step. This must be done to avoid the 
instruction from being executed repeatedly in stage 2- Tne 
reason this does not kill instructions with long immediates 
or delay slots is because of the signal ien2. The signal ien2 
is not set when there is an instruction in stage 2 that uses 

long immediate or delay slot in stage 1 in this situation. 

reason is that stage 2 stalls while another fetch is being 

in order to get the LIMM/delay slot. 



15 



" The appropriate value is latched into p2iv when the instruction in 



stage 1 

is allowed to move into stage 2. 



"jumps can move independently of delay slot instructions in this case 
— delays which need to be killed are killed by pending kill logic 
(i pending kill__r) 

n r,2iv" <- '0' WHEN ( (i break stagel = 'IM AND 

n,p2iv < u .^^^^ stage! = ' 0 ' ) AND ien2 = 'IM OR 

(p2step = '1' AND ien2 = '!') ELSE 
ip2iv WHEN ienl = '0' E^^E 
«0' WHEN (plint OR p2int) = '1' 

OR ip21iinm = '1' 
OR ip2killnext == '1' 

ELSE 

2Q OR ip2killnext = '1' 



20 



25 



or i_pending_kill_r = '1' 



35 



ELSE 

ivalid; 

p2ivreg : PROCESS (ck, clr) 
BEGIN 



40 IF clr - '1' THEN 

ip2iv <= '0'; 
ELSIF {ck* EVENT AND ck = '1') THEN 

ip2iv <= n_p2iv; 
END IF; 



45 



END PROCESS; 

p3iv : Stage 3 instruction valid 



~ This signal indicates that stage 3 contains a valid instruction. The 
instruction in stage 3 may not be valid for a number of reasons: 

II a. The instruction was marked as invalid when it moved into 



55 stage 2, 



60 



some 

complete 



i.e. p2iv = '0' . 

The instruction in stage 2 has not been able to complete for 
reason, and the instruction in stage 3 has been able to 
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and will move on at the end of the cycle. It is thus 
to insert a blank slot into stage 3 to fill in the gap. If 



necessary 
this 



is not done, the instruction which was in stage 3 will be 
executed again, and this would of course be *bad news*. 



n_p3iv <= ip3iv WHEN ienS = '0' ELSE 
'0' WHEN ien2 = '0' AND ien3 = '1' ELSE 

ip2iv; 

pBivreg : PROCESS (ck, clr) 
BEGIN 

IF clr = '1' THEN 

ip3iv <= ' 0 ' ; 
ELSIF {ck' EVENT AND ck = '1') THEN 

ip3iv <= n_p3iv; 
END IF; 

END PROCESS; 

Disable Logic to Stall the ARC 

for Actionpoint System ■ 



— The pipeline flushing mechanism has been introduced to support the 

— breakpoint instruction and actionpoint hardware. Each stage of the 

— pipeline is stalled explicitly, and once all stages one, two and three 

— have been stalled the ARC is stalled via en bit 

— This signal is true when both of the the following conditions are 
true : " 

a. The instruction in stage one should be killed when it 

advances 

into stage two. 

b. The actionpoint mechanism was set by. the hardware breakpoint 
alone . 

i_kill_AP <= ip2killnext AND hw_brk_only; 

— The stalling-^signal for stalling enl is defined by i_break_stagel, and 

— this is set to '1' on the following conditions: 



I.e. 

valid 



The breakpoint instruction has been detected at stage one, 
i_brk_inst = 'I'or an actionpoint has been triggered by a 
signal from the OR~plane. 



b. The instruction in stage one of the pipeline is to be 
executed, 

and not killed- 
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c. 



The sleep instruction has been detected in .stage 2- 



d. The ARC is sleeping already (sleeping - '1') due to a sleep 
instruction that was encountered earlier. 

5 ~ 

i break stagel <= '1' WHEN i_brk_inst = '1' 
" OR ip2sleep_inst = ' 1 ' 

OR sleeping = '1' 
in OR (actionhalt = '1' 

AND i kill AP = '0') ELSE 



'0' 



55 



stage 
has 



— The stalling signal for stalling en2 is defined by i_break_stage2, and 
15 this is set to '1' on the following conditions: 

a. A breakpoint/actionpoint instruction has been detected at 

stage ^^^^ ^^^^ i_brk_inst = '1'. For example, an actionpoint has 

20 ~ been triggered by a valid signal from the OR-plane, 

This has to true when there is an instruction 
in stage one is in the delay slot of a branch, jump or loop 
instruction. It can also be long immediate data. 

25 b. A breakpoint/actionpoint instruction has been detected at 

one, i.e. i_break_stagel = 'l*- For example, an actionpoint 

been triggered by a valid signal from the OR-plane . 

30 This has to true when there is an interrupt in 

stage two, i.e. p2int = '1' 

i break stage2 <= '1' WHEN ( (ip2pldep = '1' OR p2int = '1') 

- AND (i_break_stagel = '1')) ELSE 

35 '0''- 

— As the pipeline is flushed of instructions when the breakpoint 

instruction , . ^ j_ ^ ^. wt ^^^k 

— or a valid actionpoint is detected it is important to disable each 

40 staoe • 
explicity. These signals have to follow the last instruction which is 

—""allowed to complete. A normal instruction in stage one will mean that 
~- instructions in stage two, three and four will be allowed to complete. 

45 However, , . -j n i 4. « 

~ for an instruction in stage one which is m the delay slot of a 

-"or^jump'^instruction means that stage two has to be stalled as well. 

Therefore, , _ ^ 

50 ~ only stages three and four will be allowed to complete. 

~ The qualifying valid signal for stage two is defined by 
i n AP p2disable, . . 

~ and^this is set to '1' on the following conditions: 

There is an instruction in stage two which has a dependency 

stage one, i.e. i_break_stage2 = '1'. 

60 — b. The breakpoint instruction or actionpoint has been detected, 



— a . 

in 



I.e. 
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10 



15 



30 



40 



enabled, 

— ( 
i.e. 

invalid. 



i__break_stagel = '1' and the instruction in stage two is 
ien2 = and the instruction is allowed to move on. 

The breakpoint instruction or actionpoint has been detected, 
i_break_stagel - '1' and the instruction in stage two is 
ip2iv = ' 0 ' . 



i_n_AP_p2disable <= '1' WHEN i_break_stage2 = '1' OR 

(i_break_stagel « 'I'AND 
( (ien2 = 'IV AND ip2iv = '1') OR 
ip2iv = '0' ) ) 



ELSE 



— The qualifying valid signal for stage three is defined by 
20 i_n_AP_p3disable, 

— and this is set to '1' on the following conditions: 

a. The instruction in stage two is invalid, i_AP_p2disable_r = 

'1' • 

25 — Also the instruction in stage three is enabled, en3 '1'/ 



and 



the instruction is allowed to move on. 

The instruction in stage two is invalid, i_AP_p2disable_r = 
Also the instruction in stage three is invalid, ip3iv « 'O*. 



i_n_AP_p3disable <= '0' WHEN i_break_stagel '0' ELSE 

'1' WHEN (i_AP_p2disable_r « '1') AND 
35 ((ienS = '1' AND ip3iv = '1') OR 

ip3iv = ' 0' ) ELSE 

'0'; 



updat€_AP_disable : PROCESS (ck, clr) 
BEGIN 



IF Clr = '1' THEN 

i_AP_p2disable_r <= ' 0 ' ; 
45 i_AP_p3disable_r <= ' 0 ^• 

ELSIF (ck' EVENT AND ck = '1') THEN 

i_AP_p2disable_r <= i__n_AP__p2 disable ; 

i_AP_p3disable_r i_n_AP_p3di sable; 
END IF; 

50 

END PROCESS; 
— Output Dives for halting the ARC 
55 AP_p3disable_r <= i_AP_p3disable_^r; 



END synthesis; 



60 
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WF. CLAIM : 

1. A method for avoiding the stalling of long immediate data instructions in a 
pipelined digital processor core having at least fetch, decode, and execution stages, 

5 comprising; 

identifying, within said pipeline, at least one instruction containing long 
immediate values; 

determining whether said at least one instmction has merged when said at least 
one instruction is in said decode stage of said pipeline; and 
1 0 preventing said core from halting before said at least one instruction has merged. 

2. The method of Claim 1 , wherein said act of determining comprises examining 
merge logic operatively coiq)led to said decode stage of said core to determine if a valid merge 
signal is present. 

3 . The method of Claim 2, wherein said act of identifying at least one instruction 
15 comprises identifying an instruction selected from the group comprising (i) load immediate 

instructions, and (ii) jump instructions. 

4. A digital processor core, comprising: 

an instruction pipeline having a plurality of stages; 

an instruction set haviug at least one instruction with multiple word long 
20 immediate values associated therewitii; 

core logic adapted to selectively treat said at least one instruction with said 
multi-word long immediate values as a single instruction word, said core logic 
preventing stalling of said core before processing of said at least one instruction has 
completed. 

5. The core of Qaim 4, wherein said at least one instruction comprises an opcode 
and immediate data, said opcode and immediate data having at least one boundary there 
between, and said core is prevented from stalling on said at least one boundary. 

6. The core of Claim 5, wherein said instruction set further comprises a base 
instruction set and at least one extension instruction, said extension instruction being adapted 

30 to perform at least one fimction not defined within said base instruction set. 

7. The core of Claim 6, further comprising extension logic adapted to execute 

said at least one extension instruction. 

8. A method of reducing pipeline delays within a pipelined processor, 

comprising: 



25 
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providing a first instruction word; . . 

providing a second instruction word; and 

defining a single large instruction word comprising said first and second 
instruction words; ' 
5 processing said single large word as a single instruction witbin said 

processor, thereby preventing stalling of the pipeline upon execution of said first 
and second instruction words. 

9. The method of Claim 8, wherein the acts of providing said first and second 
instruction words comprises providing an instruction having at least one long immediate 

10 value. 

1 0. The method of Claim 9, wherein the act of providing said instruction having 
said at least one long inmiediate value comprises providing an instruction opcode within 
said first instruction word, and said at least one long immediate value within said second 
instruction word. 

15 11. The method of Claim 9, wherein the act of processing comprises: 

determining whether said first and second instruction words have merged 
within said pipeline; and 

if said first and second words have not merged, preventing said pipeline 
from stalling on the boimdary between said first arid said instruction words. 
20 12. A pipelined digital processor, comprising: 

a pipeline having instmction fetch, decode, execute, and writeback stages; 
a program memory adapted to store a plurality of instructions at addresses 

therein; 

a program counter adapted to provide at least one value corresponding to a 
25 at least one of said addresses in said memory; 

decode logic associated with said decode stage of said pipeline; 

an instruction set comprising a plurality of instructions, said plurality further 
comprising at least one breakpoint instruction; and 

a program comprising a predetermined sequence of at least a portion of said 
30 plurality of instructions, and including said at least one breakpoint instruction, said 

program being stored at least in part in said program memory; 

wherein the decode of said at least one breakpoint instruction during 
execution of said program occurs after said instruction fetch stage using said decode 
logic, and said wherein said program counter is reset back to the memory address 
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value associated with said breakpoint instruction after said breakpoint instruction is 
decoded. 

13. The processor of Claim 12, further comprising an extension logic unit 
adapted to execute one or more extension instructions. 

14; The processor of Claim 1 3, wherein said instruction set further comprises at 
least one extension instruction, said at least one extension instruction adapted to perform a 
predetermined function upon execution within said extension logic unit 

15. A method of debugging a digital processor having a multi-stage pipeline 
with fetch, decode, execute, and writeback stages, a program memory, a program counter 
adapted to provide at least one address within said memory, and an instruction set stored at 
least in part within said program memory, said instruction set including at least one 
breakpoint instruction, comprising; 

providing a program comprising at least a portion of said instruction set and 

at least one breakpoint instruction; 

running said program on said processor; 

decoding said at least one breakpoint instruction during program execution 
at said decode stage of the pipeline; 

executmg the breakpoint instruction in order to halt operation of said 

processor; 

) resetting said program counter to the memory address value associated with 

said breakpoint instruction; and 

debugging said processor at least in part while said processor is halted. 

16. The method of Claim 1 5, wheremsaid instruction set includes at least one 
extension instruction, said at least one extension instruction adapted to perform a 

5 predetermined function upon execution within said processor, said act of providuxg a 
program further comprises providing said at least one extension instraction therein, said 
method further comprising executing said at least one extension instruction during said 
dedubbing. 

17. A method of enhancing the performance of a digital processor design, said 
[0 processor design having a multi-stage mstruction pipeline including at least instruction 

fetch, decode, and execution stages, an instruction set having at least one brealqpoint 
instruction associated therewith, a program memory, and a program counter controlled at 
least in part by pipeline control logic, the method comprising: 
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providing a program comprising at least a portion of said instruction set, said 
at least portion including said breakpoint instruction; 

simulating the operation of said processor using said program; 

identifying a first critical path within the processing of said program based at 
5 least in part on said act of simulating, said critical path including the processing of 

said breakpoint instruction within said program* and 

modifying said design to decode said breakpoint instruction within said 
decode stage of said pipeline so as to reduce processing delays associated with said 
first critical path. 

10 18. The method of Claim 17, wherein the act of modifying further comprises 

adapting said pipeline control logic so that said program counter resets to the memory 
address value associated with said breakpoint instruction after said breakpoint instruction is 
decoded within said decode stage. 

19. A method of reducing pipeline delays within the pipeline of a digital 
15 processor, comprising: 

providing a first register having a plurality of operating modes; 

defining a bypass mode for said first register, wherein during operation in 
said bypass mode, said register maintains the result of a first multi-cycle operation 
therein; 

20 performing a fust multi-cycle operation to produce a first result; 

storing said first result of said first operation in said first register using said 
bypass mode; 

obtaining said furst residt of said first operation direcfiy fi-om said register; 

and 

25 performing a second multi-cycle operation using at least said first result of 

said first operation, said second operation producing a second result . 

20, The method of Claim 19, wherein said multi-cycle operation comprises an 
iterative scalar calculation, said method further comprising performing the acts of storing, 
obtaining, and performing for said second result of said second operation, and a plurality of 

30 subsequent results from respective subsequent operations, wherein the result of a given 
operation is stored in said first register using said bypass mode, and subsequently obtained 
from said register for use in the next subsequent iteration of said calciilation. 
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21. A processor core, comprising: 

a mvilti-stage instruction pipeline having at least fetch, decode, and execute 

stages; 

an instruction set having at least one multi-cycle instruction and at least one 
5 other instruction subsequent thereto; and 

a first register disposed within the execute stage of said pipeline, said first 
register having a bypass mode associated therewith, said bypass mode adapted to: 

(i) retain at least a portion of the result of the execution of 
said at least one multi-cycle instruction within said execute 

10 stage; and 

(ii) present said result to said at least one other instruction for 
use thereby. 

22. The processor core of Claim 21 , wherein said first register is fiirther adapted 
to latch source operands to permit fiilly static operation. 
15 23. The processor core of Claim 21, wherein said at least one multi-cycle 

instruction comprises two sequential data words, the first of said data words comprising at 
least opcode, and the second of said data words comprising at least one operand. 

24. The processor core of Claim 23, fiirther comprising core logic adapted to 
selectively treat said at least one multi-cycle instruction with said data words as a single 

20 instruction word, said core logic preventing stalling of said core before processmg of said at 
least one instruction has completed. 

25. The processor core of Claim 23, wherein said instruction set fiirther 
comprises at least one extension msti^iction, said at least one extension instruction being 
adapted to perform a predetermined fimction upon execution thereof by said core. 

25 26. The processor core of Claim 25, fiirther comprising an extension logic umt 

adapted to execute said at least one extension instruction. 

27. A method of operating a data cache within a pipelined processor, said pipeline 
comprising a plurality of stages including at least decode and execute stages, at least one 
execution unit withm said execute stage, and pipeline control logic, said method comprising: 
3 0 providing a plurality of instruction words; 

introducing said plurahty of instruction words within said stages of said 
pipeline successively; 

allowing said instruction words to advance one stage ahead of the data word 
within said data cache; 
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exjamining the status of said data cache; and 

stalling said pipeline using said control logic only when a data word required 
by said at least one execution unit is not present within said data cache. 

28. The inethod of Claim 27, further comprising: 

5 making said data word available to said execution unit; and 

updating the operand for the instruction in the stage prior to said execute stage. 

29. The method of Claim 2S, wherein the act of updating comprises updating the 
operand in the decode stage of said pipeline. 

30. A method of synthesizing the design of an integrated circuit, said design 
1 0 including a pipelined processor having optimized pipeline performance: 

providing input regarding the configuration of said design, said 
configuration including at least one optimized pipeline architectural function; 

providing at least one library of functions, said at least one library 
comprising descriptions of functions including that of said at least one pipeline 
15 architectural function; 

creating a functional description of said design based on said input and said 
at least one library of functions; 

determining a design hierarchy based on said input and at least one library; 
generating stmctural HDL and a script associated therewith; 
20 ruiming said script to create a synthesis script; and 

synthesizing said design using synthesis script. 

3 1 . The method of Claim 30, wherein the act of providing input regarding the 
said at least one optimized pipeline architectural function comprises: 

describing at least one multi-word instruction comprising a first opcode 
25 word and a second data word; and 

specifying that said instmctipn is non-stallable on the boundary between said 
furst and second words during execution thereof 

32. The method of Claim 30, wherein the act of providing input regarding the 
said at least one optimized pipeline architectural function comprises: 

30 describing a multi-function register disposed vsdthin said pipeline, said 

register adapted to store the results of the execution of a multi-cycle instruction 
word within the execute stage of said pipeline; and 

specifying that sid result be provided to at least one instruction subsequent to 
said mvilti-cycle instmction within said pipleine dxnring operation. 
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33. The method of Claim 30, wherein the act of providing input regarding the 
said at least one optimized pipeline architectural ftmction comprises: 

describing pipeline control logic adapted to control the operation of said 
pipeline; 

5 describing at least one execution unit within the execution stage of said 

pipeline; 

describing at least one data cache structure within said design; 

specifying that said pipeline control logic be at least partly decoupled from 
said data cache, thereby allowing the processing of a given instruction within said 
10 pipeline to proceed ahead of said data cache; and 

further specifying that said pipeline control logic halt said pipeline if a data 
word required by said at least one execution unit is not present within said data 
cache. 
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^ (54) Title: METHOD AND APPARATUS FOR ENHANCING THE PERFORMANCE OF A PIPELINED DATA PROCESSOR 

^ (57) Abstract: A method and apparatus for enhancing the performance of a multi-stage pipeline in a digital processor In one 
^ aspect, the stalling of multi-word (e.g. long immediate data) instructions on the word boundary is prevented by defining oversized 
^ or "atomic" insuiictions within the instruction set, thereby also preventing incomplete data fetch operations. In another aspect, the 
^ inventioD comprises delayed decode of breakpoint instructions within the core so as to remove critical path restrictions in the pipeline. 
^ In yet another aspect, the invention comprises a multi-function register disposed in the pipeline logic, the register including a bypass 
^ mode adapted to selectively bypass or "shortcut" subsequent logic, and return the result of a multi-cycle operation directly to a 
subsequent insiniciion requiring the result. Improved data cache integration and operation techniques, and apparatus for synthesizing 
logic implementing the aforementioned methodology are also disclosed. 



